Ontology-Based Information
Extraction and Structuring
Stephen W. Liddle†
School of Accountancy and Information Systems
Brigham Young University
Douglas M. Campbell, David W. Embley,‡ and Randy D. Smith
Research funded in part by †Faneuil Research and ‡Novell, Inc.
Copyright  1998
Motivation

Database-style queries are effective
– Find red cars, 1993 or newer, < $5,000
• Select * From Car Where Color=“red” And
Year >= 1993 And Price < 5000

Web is not a database
– Uses keyword search
– Retrieves documents, not records
– Assuming we have a range operator:
• “red” and (1993 to 1998) and (1 to 5000)
Solutions


Web query languages
Wait for XML to emerge
– Interoperation/Standards?
– XML query language?

Wrappers
– Hand-written or semi-automatically
generated parsers
– Specific to source site, subject to change
Our Approach


Automatic wrapper generation
Based on application ontology
– Augmented conceptual model
– Defines constants, keywords, their
relationships

Best for:
– Narrow ontological breadth
– Data-rich documents
Car-Ad Ontology

Object-Relationship Model + Data Frames
Year
Price
1..*
1..*
Make 1..*
has
has
has
0..1
0..1
Model 1..*
0..1
Car
has
Mileage
has
0..1
0..*
1..*
0..1
0..1
has
is for 1..*
PhoneNr
has
1..*
Feature
0..1
1..*
Extension
Car [0:1] has Year [1:*];
Year {regexp[2]: “\d{2} : \b’\d{2}\b, … };
Car [0:1] has Make [1:*];
Make {regexp[10]: “\bchev\b”, “\bchevy\b”, … };
Car [0:1] has Model [1:*];
Model {…};
Car [0:1] has Mileage [1:*];
Mileage {regexp[8] “\b[1-9]\d{1,2}k”,
“1-9]\d?,\d{3} : [^\$\d][1-9]\d?,\d{3}[^\d]” }
{context: “\bmiles\b”, “\bmi\.”, “\bmi\b”};
Car [0:*] has Feature [1:*];
Feature {regexp[20]:
-- Colors
“\baqua\s+metallic\b”, “\bbeige\b”, …
-- Transmission
“(5|6)\s*spd\b”, “auto : \bauto(\.|,)”,
-- Accessories
“\broof\s+rack\b”, “\bspoiler\b”, …
...
Graphical
Textual
(See Figures 2 & 3 of Paper)
Fixed Processes
Application
Ontology
Ontology
Parser
Constant/Keyword
Matching Rules
Unstructured
Document
Constant/Keyword
Recognizer
Data-Record Table
List of Objects, Relationships, and Constraints
Database-Instance
Generator
Populated
Database
(See Figure 1 of Paper)
Database
Scheme
Ontology Parser
Application
Ontology
Ontology
Parser
Constant/Keyword
Matching Rules
Make : \bchev\b
… Unstructured
Constant/Keyword
Document
Recognizer
KEYWORD(Mileage)
: \bmiles\b
KEYWORD(Mileage) : \bmi\.
...
List of Objects, Relationships, and Constraints
table Car (
integer,
Object: Car;
Year varchar(2),
...
… );
Car: Year [0:1];
Car: Make [0:1];
Populatedcreate table CarFeature (
…
Database Car integer,
Feature varchar(10));
CarFeature: Car [0:*] has Feature [1:*];
...
Data-Record Table
create
Database-Instance
Generator Car
Database
Scheme
Constant/Keyword Recognizer
Application
Ontology
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her.
Previous owner heart broken! Asking only $11,995. #1415.
Ontology
JERRY SEINER MIDVALE, 566-3800
Parser
Constant/Keyword
Matching Rules
Unstructured
Document
Constant/Keyword
Recognizer
Data-Record Table
List of Objects, Relationships, and Constraints
Database
Scheme
Descriptor/String/Position(start/end)
Year|97|1|3
Make|CHEV|5|8
Database-Instance
Model|Cavalier|10|17
Generator
Feature|Red|20|22
Feature|5 spd|25|29
Mileage|7,000|37|41
KEYWORD(Mileage)|miles|43|47
Populated
Database
Price|11,995|108|114
PhoneNr|556-3800|146|153
Application
Ontology
Database-Instance Generator
Ontology
Parser
Constant/Keyword
Matching Rules
Unstructured
Document
Constant/Keyword
Recognizer
Data-Record Table
List of Objects, Relationships, and Constraints
Database-Instance
Generator
Populated
Database
insert into Car values(1001, “97”, “CHEV”, “Cavalier”,
“7,000”, “11,995”, “556-3800”)
insert into CarFeature values(1001, “Red”)
insert into CarFeature values(1001, “5 spd”)
Database
Scheme
Heuristics





Keyword proximity
Subsumed and overlapping constants
Functional relationships
Nonfunctional relationships
First occurrence without constraint
violation
Keyword Proximity
D=2
Year|97|2|3
Make|CHEV|5|8
Model|Cavalier|10|17
Feature|Red|20|22
Feature|5 spd|25|29
Mileage|7,000|37|41
KEYWORD(Mileage)|miles|43|47
Price|11,995|101|106
Mileage|11,995|101|106
PhoneNr|566-3800|140|147
D = 54
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her.
Previous owner heart broken! Asking only $11,995. #1415.
JERRY SEINER MIDVALE, 566-3800
Subsumed/Overlapping Constants
Make|CHEV|5|8
Make|CHEVROLET|5|13
Model|Cavalier|15|22
Feature|Red|25|27
Feature|5 spd|30|34
Mileage|7,000|42|46
KEYWORD(Mileage)|miles|48|52
Price|11,995|101|106
Mileage|11,995|101|106
PhoneNr|566-3800|140|147
'97 CHEVROLET Cavalier, Red, 5 spd, only 7,000 miles.
Previous owner heart broken! Asking only $11,995. #1415.
JERRY SEINER MIDVALE, 566-3800
Functional Relationships
Year|97|2|3
Make|CHEV|5|8
Model|Cavalier|10|17
Feature|Red|20|22
Feature|5 spd|25|29
Mileage|7,000|37|41
KEYWORD(Mileage)|miles|43|47
Price|11,995|101|106
Mileage|11,995|101|106
PhoneNr|566-3800|140|147
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her.
Previous owner heart broken! Asking only $11,995. #1415.
JERRY SEINER MIDVALE, 566-3800
Nonfunctional Relationships
Year|97|2|3
Make|CHEV|5|8
Model|Cavalier|10|17
Feature|Red|20|22
Feature|5 spd|25|29
Mileage|7,000|37|41
KEYWORD(Mileage)|miles|43|47
Price|11,995|101|106
Mileage|11,995|101|106
PhoneNr|566-3800|140|147
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her.
Previous owner heart broken! Asking only $11,995. #1415.
JERRY SEINER MIDVALE, 566-3800
First Occurrence without
Constraint Violation
Year|97|2|3
Make|CHEV|5|8
Model|Cavalier|10|17
Feature|Red|20|22
Feature|5 spd|25|29
Mileage|7,000|37|41
KEYWORD(Mileage)|miles|43|47
Price|11,995|101|106
Mileage|11,995|101|106
PhoneNr|566-3800|140|147
PhoneNr|566-3802|149|156
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her.
Previous owner heart broken! Asking only $11,995. #1415.
JERRY SEINER MIDVALE, 566-3800, 566-3802
Recall & Precision
N = number of facts in source
C = number of facts declared correctly
I = number of facts declared incorrectly
Recall =
C
(of facts available, how many did we find?)
N
Precision =
C
CI
(of facts retrieved, how many were relevant?)
'97 CHEV Cavalier, Red, 5 spd, only 7,000 miles on her.
Previous owner heart broken! Asking only $11,995. #1415.
JERRY SEINER MIDVALE, 566-3800
Experimental Results
Salt Lake Tribune
Year
Make
Model
Mileage
Price
PhoneNr
Extension
Feature
Recall %
100
97
82
90
100
94
50
91
(See Table 1 of Paper)
Precision %
100
100
100
100
100
100
100
99
Tuning set: 100
Test set: 116
Trouble Spots

Unbounded sets
– missed: MERC, Town Car, 98 Royale
– could use lexicon of makes and models

Unspecified variation in lexical patterns
– missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)
– could adjust lexical patterns

Misidentification of attributes
– classified AUTO in AUTO SALES as automatic transmission
– could adjust exceptions in lexical patterns

Typographical errors
– “Chrystler”, “DODG ENeon”, “I-15566-2441”
– could look for spelling variations and common typos
Contributions





Fully automatic technique for wrapper
generation
Uses syntactic, not semantic constantrecognition techniques
Adapts readily to different unstructured
document formats
Good precision & recall ratios
Implemented (Perl, C++, Lex/Yacc, Java)
Limitations


Works best for data-rich documents,
narrow ontological domains
Ontology creation is still manual
– Domain expert
– Trained in our conceptual model & tools
Future Work


Graphical ontology editor
Improve automatic record-boundary
recognition
– Make suitable for broader domains
(obituaries, university catalog, etc.)

Improve heuristics
– Use a declarative language
– Employ more of OSM’s rich constraints
Future Work (cont.)

Add operations to data frames
– General constraints
– Canonical representations
– Inferred information




Develop ontology libraries
Finish porting to 100% Java
Incorporate learning/feedback
Ontology-enabled agents
Our Web Site



I have a demo on my laptop
Can download from our Web site
BYU Data Extraction Group
http://osm7.cs.byu.edu/deg
(See Reference 13 of Paper)
Other Domains



Job Listings
Obituaries
University Course Catalogs
Job Listings Results
Los Angeles Times
Degree
Skill
Contact
Email
Fax
Voice
Recall %
100
74
100
91
91
79
(See Table 2 of Paper)
Precision %
100
100
100
83
100
92
Tuning set: 50
Test set: 50
Obituaries Results
Salt Lake Tribune
Deceased Name
Age
Birth Date
Death Date
Funeral Date
Funeral Address
Funeral Time
Interment Address
Viewing
Viewing Date
Viewing Address
Beginning Time
Ending Time
Relationship
Relative Name
Recall %
100
91
100
94
92
96
97
100
93
70
76
88
90
81
88
Precision %
100
95
97
100
100
96
100
100
96
100
100
100
100
93
71
(See our forthcoming ER’98 paper for details.)
Tuning set: ~40
Test set: 38
Obituaries Results
Arizona Daily Star
Deceased Name
Age
Birth Date
Death Date
Funeral Date
Funeral Address
Funeral Time
Interment Address
Viewing
Viewing Date
Viewing Address
Beginning Time
Ending Time
Relationship
Relative Name
Recall %
100
86
96
84
96
82
92
100
97
100
95
93
95
92
95
Precision %
100
98
96
99
93
82
87
100
100
100
100
96
100
97
74
(See our forthcoming ER’98 paper for details.)
Tuning set: ~40
Test set: 90
Descargar

Ontology-Based Information Extraction and Structuring