A Survey of WEB
Information
Extraction Systems
Chia-Hui Chang
National Central University
Sep. 22, 2005
Introduction
• Abundant information on the Web
– Static Web pages
– Searchable databases: Deep Web
• Information Integration
– Information for life
• e.g. shopping agents, travel agents
– Data for research purpose
• e.g. bioinformatics, auction economy
Introduction (Cont.)
• Information Extraction (IE)
– is to identify relevant information from
documents, pulling information from a
variety of sources and aggregates it into
a homogeneous form
• An IE task is defined by its input and
output
An IE Task
Web Data Extraction
Data
Record
Data
Record
IE Systems
• Wrappers
– Programs that perform the task of
IE are referred to as extractors or
wrappers.
• Wrapper Induction
– IE systems are software tools that
are designed to generate wrappers.
Various IE Survey
• Muslea
• Hsu and Dung
• Chang
• Kushmerick
• Laender
• Sarawagi
• Kuhlins and Tredwell
Related Work: Time
• MUC Approaches
– AutoSolg [Riloff, 1993], LIEP [Huffman,
1996], PALKA [Kim, 1995], HASTEN
[Krupka, 1995], and CRYSTAL [Soderland,
1995]
• Post-MUC Approaches
– WHISK [Soderland, 1999], RAPIER [califf,
1998], SRV [Freitag, 1998], WIEN
[Kushmerick, 1997], SoftMealy [Hsu,
1998] and STALKER [Muslea, 1999]
Related Work: Automation Degree
• Hsu and Dung [1998]
– hand-crafted wrappers using general
programming languages
– specially designed programming
languages or tools
– heuristic-based wrappers, and
– WI approaches
Related Work: Automation Degree
• Chang and Kuo [2003]
– systems that need programmers,
– systems that need annotation
examples,
– annotation-free systems and
– semi-supervised systems
Related Work:
Input and Extraction Rules
• Muslea [1999]
– IE from free text using extraction
patterns that are mainly based on
syntactic/semantic constraints.
– The second class is Wrapper induction
systems which rely on the use of
delimiter-based rules.
– The third class also processes IE from
online documents; however the patterns
of these tools are based on both
delimiters and syntactic/semantic
constraints.
Related Work: Extraction Rules
• Kushmerick [2003]
– Finite-state tools (regular expressions)
– Relational learning tools (logic rules)
Related Work: Techniques
• Laender [2002]
–
–
–
–
languages for wrapper development
HTML-aware tools
NLP-based tools
Wrapper induction tools (e.g., WIEN, SoftMealy and
STALKER),
– Modeling-based tools
– Ontology-based tools
• New Criteria:
– degree of automation, support for complex objects, page
contents, availability of a GUI, XML output, support for
non-HTML sources, resilience and adaptiveness.
Related Work: Output Targets
• Sarawagi [VLDB 2002]
– Record-level
– Page-level
– Site-level
Related Work: Usability
• Kuhlins and Tredwell [2002]
– Commercial
– Noncommercial
Three Dimensions
• Task Domain
– Input (Unstructured, semi-structured)
– Output Targets (record-level, page-level, sitelevel)
• Automation Degree
– Programmer-involved, learning-based or
annotation-free approaches
• Techniques
– Regular expression rules vs Prolog-like logic
rules
– Deterministic finite-state transducer vs
probabilistic hidden Markov models
Task Domain: Input
Task Domain: Output
• Missing Attributes
• Multi-valued Attributes
• Multiple Permutations
• Nested Data Objects
• Various Templates for an attribute
• Common Templates for various
attributes
• Untokenized Attributes
Classification by Automation Degree
• Manually
– TSIMMIS, Minerva, WebOQL, W4F, XWrap
• Supervised
– WIEN, Stalker, Softmealy
• Semi-supervised
– IEPAD, OLERA
• Unsupervised
– DeLa, RoadRunner, EXALG
Automation Degree
• Page-fetching Support
• Annotation Requirement
• Output Support
• API Support
Technologies
• Scan passes
• Extraction rule types
• Learning algorithms
• Tokenization schemes
• Feature used
A Survey of Contemporary
IE Systems
• Manually-constructed IE tools
– Programmer-aided
• Supervised IE systems
– Labeled based
• Semi-supervised IE systems
• Unsupervised IE systems
– Annotation-free
Semi-Supervised
Un-Labeled
Training
Web Pages
GUI
Un-Supervised
GUI
Wrapper
User
Test
Page
Wrapper
User
Induction
User
Supervised
System
Manual
Extracted
data
Manually-constructed IE
Systems
• TSIMMIS [Hammer, et al, 1997]
• Minerva [Crescenzi, 1998]
• WebOQL [Arocena and Mendelzon,
1998]
• W4F [Saiiuguet and Azavant, 2001]
• XWrap [Liu, et al. 2000]
A Running Example
TSIMMIS
1 [ [ "root", "get('pe1.html')", "#"],
2 [ "Book", "root", "*<body>#</body>"],
3 [ "BookName", "Book", "*</b>#<b>"],
4 [ "Reviews", "Book", "*<ol>#</ol>"],
5 [ "_Reviewer", "split(Reviews, '<li>')", "#"],
6 [ "Reviewer", "_Reviewer[0:0]", "#"],
7 [ "ReviewerName, Rating, T ext", "Reviewer",
8
"*</b>#<b>*</b>#<b>*</b>#*"] ]
(a)
• Each command is of the form:
[variables, source, pattern] where
root complex {
book_name string "Dat abases"
reviews complex {
Reviewer_Name string John
Rating
int
7
T ext
string …
}
}
(b)
– source specifies the input text to be considered
– pattern specifies how to find the text of interest within the source, and
– variables are a list of variables that hold the extracted results.
• Note:
– # means “save in the variable”
– * means “discard”
Minerva
• The grammar used by Minerva is defined in an
EBNF style
P age Book_Reviews
$Book_Reviews: <html><body> $Book </body></html>;
$Book :
<b>Book Name </b> $bName <b> Reviews </b>
[<ol> ( <li><b> Reviewer Name </b> $rName <b>
Rating </b>$rate <b> T ext </b> $text $T P )* </ol>];
$bName :
*(?<b>);
$rName :
*(?<b>);
$rate :
*(?<b>);
$text :
*(?</li>);
$T P : {
$bName, $rName
$rate
$text
}
END
WebOQL
Select [ Z!’.Text]
From x in browse (“pe2.html”)’, y in x’, Z in y’
Where x.Tag = “ol” and Z.Text=”Reviewer Name”
Tag: Body,
Source:
<Body>…</Body>
Text: Book Name …
Tag: <b>
Source:<b>Book
Name</b>
Text: Book Name
Tag: <b>
Source:<b>Reviewer
Name</b>
Text: Reviewer Name
Tag: OL,
Source: <ol>…</ol>
Text: Reviewer Name …
Tag: NOTAG
Tag: <b>
Source: Databases Source:<b>Reviews</b
Tag: LI,
Text: Database
>
Source: <li>…</li>
Text: Reviews
Text: Reviewer
Name …
Tag: NOTAG
Source: …
Tag: NOTAG Tag: <b>
Tag: NOTAGTag: <b> Text: …
Source: John Source:<b>Rating</
Source:<b>Text</b>
Source: 7
Text: John
b>
Text: Text
Text: 7
Text: Rating
W4F
• Wysiwyg support
• Java toolkit
• Extraction rule
– HTML parse tree (DOM object)
• e.g. html.body.ol[0].li[*].pcdata[0].txt
– Regular expression to address finer
pieces of information
Supervised IE systems
• SRV [Freitag, 1998]
• Rapier [Califf and Mooney, 1998]
• WIEN [Kushmerick, 1997]
• WHISK [Soderland, 1999]
• NoDoSE [Adelberg, 1998]
• Softmealy [Hsu and Dung, 1998]
• Stalker [Muslea, 1999]
• DEByE [Laender, 2002b ]
SRV
• Single-slot information extraction
• Top-down (general to specific)
relational learning algorithm
– Positive examples
– Negative examples
• Learning algorithm work like FOIL
– Token-oriented features
– Logic rule
Rating extraction rule:Length(=1),
Every(numeric true),
Every(in_list true).
Rapier
• Field-level (Single-slot) data extraction
• Bottom-up (specific to general)
• The extraction rules consist of 3 parts:
– Pre-filler
– Slot-filler
– Post-filler
Book Title extraction rule:Pre-filler
slot-filler
post-filler
word: Book
Length=2
word=<b>
word: Name
Tag: [nn, nns]
word: </b>
WIEN
• LR Wrapper
– (‘Reviewer name </b>’, ‘<b>’, ‘Rating
</b>’, ‘<b>’, ‘Text </b>’, ‘</li>’)
• HLRT Wrapper (Head LR Tail)
• OCLR Wrapper (Open-Close LR)
• HOCLRT Wrapper
• N-LR Wrapper (Nested LR)
• N-HLRT Wrapper (Nested HLRT)
WHISK
• Top-down (general to specific) learning
• Example
– To generate 3-slot book reviews, it start with
empty rule “*(*)*(*)*(*)*”
– Each parenthesis indicates a phrase to be
extracted
– The phrase in the first set of parenthesis is
bound to variable $1, and 2nd to $2, etc.
– The extraction logic is similar to the LR
wrapper for WIEN.
Pattern:: * ‘Reviewer Name </b>’ (Person) ‘<b>’ * (Digit) ‘<b>Text</b>’(*) ‘</li>’
Output:: BookReview {Name $1} {Rating $2} {Comment $3}
NoDoSE
• Assume the order of attributes within a
record to be fixed
• The user interacts with the system to
decompose the input.
• For the running example
– a book title (an attribute of type string) and
– a list of Reviewer
• RName (string), Rate (integer), and Text (string).
Softmealy
• Finite transducer
• Contextual rules
?/ε
?/next_token
s<b,N>/
“N=”+
next_tokn
b
?/ε
?/ε
s<,R>/
“R=”+
next_tokn
s<N, >
/ε
N
?/next_token
N
R
?/next_token
s<,T>/
“T=”+
next_tokn
R
T
s<R, e>/ ε
s<,R>L ::= HTML(<b>) C1Alph(Rating) HTML(</b>)
s<,R>R ::= Spc(-) Num(-)
s<R,>L ::= Num(-)
s<R,>R ::= NL(-) HTML(<b>)
s<T,e>
/ε
e
Stalker
• Embedded Category Tree
• Multipass Softmealy
Whole document
Name List(Reviewer)
Name
Rate
(a)
T ext
Extraction ru le for Lis t(Re vie we
: r)
SkipT o(<ol>)
SkipT o(</ol>)
Ite ration ru le for List(Re vie we r):
SkipT o(<li>)
SkipT o(</li>)
Extraction ru le for Ratin g:
SkipT o(Rating </b>)
SkpT o(<b>)
(b)
DEByE
• Bottom-up extraction strategy
• Comparison
– DEByE: the user marks only atomic
(attribute) values to assemble nested
tables
– NoDoSE: the user decomposes the
whole document in a top-down fashion
Semi-supervised Approaches
• IEPAD [Chang and Lui, 2001]
• OLERA [Chang and Kuo, 2003]
• Thresher [Hogue, 2005]
IEPAD
• Encoding of the input page
• Multiple-record pages
– Pattern Mining by PAT Tree
• Multiple string alignment
• For the running example
– <li><b>T</b>T<b>T</b>T<b>T</b>T</li>
OLERA
• Online extraction rule analysis
– Enclosing
– Drill-down / Roll-up
– Attribute Assignment
Thresher
• Work similar to OLERA
• Apply tree alignment instead of
string alignment
Unsupervised Approaches
• Roadrunner [Crescenzi, 2001]
• DeLa [Wang, 2002; 2003]
• EXALG [Arasu and Garcia-Molina, 2003]
• DEPTA [Zhai, et al., 2005]
Roadrunner
• Input: multiple
pages with the
same template
• Match two
input pages at
one time
Wrapper (initially)
01: <html><body>
02: <b>
03:
Book Name
04: </b>
05: Databases
06: <b>
07:
Reviews
08: </b>
09: <OL>
10: <LI>
11:
<b> Reviewer Name </b>
12:
John
13:
<b> Rating </b>
14:
7
15:
<b>Text </b>
16:
…
17: </LI>
10: </OL>
11:</body></html>
parsing
String mismatch
String mismatch
String mismatch
String mismatch
tag mismatch
Terminal search match
Wrapper after solving mismatch
<html><body><b> Book Name </b>
#PCDATA<b> Reviews </b>
<OL>
( <LI><b> Reviewer Name </b> #PCDATA
<b> Rating </b> #PCDATA
<b> Text </b> #PCDATA </LI> )+
</OL></body></html>
Sample page
01: <html><body>
02: <b>
03:
Book Name
04: </b>
05: Data mining
06: <b>
07:
Reviews
08: </b>
09: <OL>
10: <LI>
11:
<b> Reviewer Name </b>
12:
Jeff
13:
<b> Rating </b>
14:
2
15:
<b>Text </b>
16:
…
17: </LI>
18: <LI>
19:
<b> Reviewer Name </b>
20:
Jane
21:
<b> Rating </b>
22:
6
23:
<b>Text </b>
24:
…
25: </LI>
26: </OL>
27:</body></html>
DeLa
• Similar to IEPAD
– Works for one input page
• Handle nested data structure
• Example
– <P><A>T</A><A>T</A>
T</P><P><A>T</A>T</P>
– <P><A>T</A>T</P><P><A>T</A>T</P>
– (<P>(<A>T</A>)*T<P>)*
EXALG
• Input: multiple pages with the same
template
• Techniques:
– Differentiating token roles
– Equivalence class (EC) form a template
• Tokens with the same occurrence vector
DEPTA
• Identify data region
– Allow mismatch between data records
• Identify data record
– Data records may not be continuous
• Identify data items
– By partial tree alignment
Comparison
• How do we differentiate template token
from data token?
– DeLa and DEPTA assume HTML tags are
template while others are data tokens
– IEPAD and OLERA leaves the problems to users
• How to apply the information from
multiple pages?
– DeLa and DEPTA conduct the mining from
single page
– Roadrunner and EXALG do the analysis from
multiple pages
Comparison (Cont.)
• Techniques improvement
– From string alignment (IEPAD,
RoadRunner) to tree alignment (DEPTA,
Thresher)
– From full alignment (IEPAD) to partial
alignment (DEPTA)
Task domain comparison
• Page type
– structured, semi-structured or free-text
Web pages
• Non-HTML support
• Extraction level
– Field level, record-level, page-level
Task domain comparison
(Cont.)
• Extraction target variation
– Missing attributes, multiple-value
attributes, multi-order attribute
permutation
• Template variation
• Untokernized Attributes
Manual
Supervi
sed
SemiSupervi
sed
UnSupervis
ed
Extraction Targets Variation
Template Variation
Page
Type
NHS
Extraction
Level
MA/MVA
MOA
Nested
VF
CT
Minerva
Semi-S
Yes
Record Level
Yes
Yes
Yes
Both
No
Yes
TSIMMIS
Semi-S
Yes
Record Level
Yes
No
Yes
Disj
No
No
WebOQL
Semi-S
No
Record Level
Yes
Yes
Yes
Disj
No
No
W4F
Temp
No
Record Level
Yes
Yes
Yes
SP
No
Yes
XWRAP
Temp
No
Record Level
Yes
No
Yes
SP
No
Yes
RAPIER
Free
Yes
Field Level
Yes
--
--
Disj
Yes
No
SRV
Free
Yes
Field Level
Yes
--
--
Disj
Yes
No
WHISK
Free
Yes
Record Level
Yes
Yes
No
Disj
Yes
No
NoDoSE
Semi-S
Yes
Page/Record
Yes
Limited
Yes
No
No
No
DEByE
Semi-S
Yes
Record Level
Yes
Yes
Yes
Disj
No
No
WIEN
Semi-S
Yes
Record Level
No
No
No
No
No
No
STALKER
Semi-S
Yes
Record Level
Yes
Yes
Yes
Both
No
Yes
SoftMealy
Semi-S
Yes
Record Level
Yes
Limited
Multi
Pass
Disj
No
Yes
IEPAD
Temp
No
Record Level
Yes
Limited
Limited
Both
No
Yes
OLERA
Temp
No
Record Level
Yes
Limited
Limited
Both
No
Yes
DeLa
Temp
No
Record Level
Yes
Limited
Yes
Both
No
No
RoadRunner
Temp
No
Page Level
Yes
No
Yes
SP
No
No
EXALG
Temp
Yes
Page Level
Yes
No
Yes
Both
No
No
DEPTA
Temp
No
Record Level
Yes
No
Limited
Disj
No
No
Tools
UTA
Technique-based comparison
• Scan pass
– Single pass vs mutiple pass
• Extraction rule type
– Regular expression vs. logic rules
• Feature used
– DOM tree information, POS tags, etc.
• Learning algorithm
– Machine learning vs pattern mining
• Tokernization schemes
Tools
Scan Pass
Extraction
Rule Type
Features Used
Learning Algorithm
Tokenization
Schemes
Minerva
Single
Regular exp.
HTML tags/Literal words
None
Manually
TSIMMIS
Single
Regular exp.
HTML tags/Literal words
None
Manually
WebOQL
Single
Regular exp.
Hypertree
None
Manually
W4F
Single
Regular exp.
DOM tree path addressing
None
Tag Level
XWRAP
Single
ContextFree
DOM tree
None
Tag Level
RAPIER
Multiple
Logic rules
Syntactic/Semantic
ILP (bottom-up)
Word Level
SRV
Multiple
Logic rules
Syntactic/Semantic
ILP (top-down)
Word Level
WHISK
Single
Regular exp.
Syntactic/Semantic
Set covering (top-down)
Word Level
NoDoSE
Single
Regular exp.
HTML tags/Literal words
Data Modeling
Word Level
DEByE
Single
Regular exp.
HTML tags/Literal words
Data Modeling
Word Level
WIEN
Single
Regular exp.
HTML tags/Literal words
Ad-hoc (bottom-up)
Word Level
STALKER
Multiple
Regular exp.
HTML tags/Literal words
Ad-hoc (bottom-up)
Word Level
SoftMealy
Both
Regular exp.
HTML tags/Literal words
Ad-hoc (bottom-up)
Word Level
IEPAD
Single
Regular exp.
HTML tags
Pattern Mining, String
Alignment
Multi-Level
OLERA
Single
Regular exp.
HTML tags
String Alignment
Multi-Level
RoadRunner
Single
Regular exp.
HTML tags
String Alignment
Tag Level
EXALG
Single
Regular exp.
HTML tags/Literal words
Equivalent Class and
Role Differentiation
Word Level
DeLa
Single
Regular exp.
HTML tags
Pattern Mining
Tag Level
GUI
support
PageFetching
support
Output Support
Training
Examples
API.
Support
Minerva
No
No
XML
No
Yes
TSIMMIS
No
No
Text
No
Yes
WebOQL
No
No
Text
No
Yes
W4F
Yes
Yes
XML
Labeled
Yes
XWRAP
Yes
Yes
XML
Labeled
Yes
RAPIER
No
No
Text
Labeled
No
SRV
No
No
Text
Labeled
No
WHISK
No
No
Text
Labeled
No
NoDoSE
Yes
No
XML, OEM
Labeled
Yes
DEByE
Yes
Yes
XML, SQL DB
Labeled
Yes
WIEN
Yes
No
Text
Labeled
Yes
STALKER
Yes
No
Text
Labeled
Yes
SoftMealy
Yes
Yes
XML, SQL DB
Labeled
Yes
IEPAD
Yes
No
Text
Unlabeled
No
OLERA
Yes
No
XML
Unlabeled
No
RoadRunner
No
Yes
XML
Unlabeled
Yes
EXALG
No
No
Text
Unlabeled
No
DeLa
No
Yes
Text
Unlabeled
Yes
Tools
Conclusion
• Criteria for evaluating IE systems
from the task domain
• Comparison of IE systems from
various automation degree
• The use of various techniques in IE
systems
Future Work
• Page Fetching
– XWrap, W4F, WNDL
• Schema Mapping
– Full information
– Partial information
• Query Interface Integration
References
• C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan,
A survey of Web Information Extraction Systems.
Descargar

Comparison of IE Survey