Schema Matching Algorithms
Phil Bernstein
CSE 590sw
February 2003
Acknowledgments
Many of these slides are from
presentations by
– Erhard Rahm, Hong Hai Do, and
Sergey Melnik (Univ. of Leipzig)
2
Mapping Schemas
Given two schemas, return an expression
that translates instances of one schema
into instances of the other (i.e., performs
data translation).
Applications
–
–
–
–
Web site integration
Catalog integration
Schema evolution
Data translation
–
–
–
–
Reverse engineering
Data warehouse loading
XML message translation
Ontology integration
3
Partitioning the Problem
Schema matching (aka mapping discovery)
– Given two schemas, return a set of correspondences that specify pairs of related terms
Semantic Mapping (aka query discovery)
– Given correspondences between two schemas,
return an expression that translates instances of
one schema into instances of the other (i.e.,
performs data translation).
4
The Schema Matching Problem

Types of schemas:


Database schemas, XML schemas, ontologies, …,
Input:




Phils’ mods are in underscored Times Roman
Two (or more) schemas, S1 and S2
Possibly data instances for S1 and S2
Background knowledge – thesauri, validated matches,
constraints (keys, data types), standard schemas, ontologies,
NLP, etc.
Output:

A mapping between S1 and S2
Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
5
Generic Match Implementation
Tool 1
(Portal
schemas)
Tool 2
(E-Business
schemas)
Tool 3 (Data
Warehousing
schemas)
Tool 4
(Database
Design)
Schema import/ export
Global
libraries
(dictionaries,
schemas …)
Generic Match
Implementation
Internal Schema
Representation
Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
6
Match Example 1
Yahoo! Shopping
Electronics and Photography
Camcorders
DV
Computers and Software
Digital Cameras
PDAs and Handhelds
Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
Epinions.com
Electronics
Video
Camcorders
Remote Controls
Home
Computers & Internet
Computer Hardware
PDAs
7
Match Example 2
CREATE TABLE PurchaseOrder.Customer (
custNo
INT,
custName VARCHAR(50),
custStreet VARCHAR(50),
custCity
VARCHAR(50),
custZip
VARCHAR(10),
PRIMARY KEY (custNo) ) ;
CREATE TABLE PurchaseOrder.ShipTo (
poNo
INT,
custNo
INT REFERENCES PO1.Customer,
shipToStreet VARCHAR(50),
shipToCity VARCHAR(50),
shipToZip VARCHAR(10),
PRIMARY KEY (poNo) ) ;
<xsd:schema xmlns:xsd="http://www.w3.org/XMLSchema">
<xsd:complexType name=“POrder" >
<xsd:sequence>
<xsd:element name=“DeliverTo" type="Address"/>
<xsd:element name=“BillTo" type="Address"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Address" >
<xsd:sequence>
<xsd:element name=“Street" type="xsd:string"/>
<xsd:element name=“City" type="xsd:string"/>
<xsd:element name=“Zip" type="xsd:decimal"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
a) A relational schema and an XML schema
POrder
PurchaseOrder
Customer
custName
custStreet
custCity
Legends:
ShipTo
custZip
Node
Containment link
shipToStreet
DeliverTo
Address
shipToCity
shipToZip
BillTo
City
b) Their corresponding graph representation
Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
Street
Zip
8
Tool Example (Biztalk Mapper)
Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
9
Current Situation

Finding mappings is now the bottleneck!



Will only get worse





largely done by hand
labor intensive, tedious & error prone
data sharing & XML become pervasive
proliferation of DTDs and XML schemas
translation of legacy data
reconciling ontologies on semantic web
Need semi-automatic approaches to scale up!
Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
10
Why Matching is Difficult

Aims to identify same real-world entity


using names, structures, types, data values, etc.
Schemas represent same entity differently

different names => same entity (synonyms):



client & user => customer
same names => different entities (homonyms):
 bug => insect or software error
Schema & data never fully capture semantics!

not adequately documented, not sufficiently expressive

data values suffer from synonyms and homonyms too

Intended semantics is typically subjective!

Cannot be fully automated. Often hard for humans.
Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
11
Desiderata for Match Solution
Low degree of manual work
 Accuracy, efficiency, ease of use
 Extensibility




Support for Reuse


Exploit additional match techniques
Exploit additional background knowledge
Exploit previous manual or automatically generated
matchings
Generic approach


Different schema languages
Different application areas
Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
12
Special Situations
Match schema S to an incremental
modification of S
– Can ignore homonyms and possibly synonyms
– Little if any reshaping of the structure
– Instances probably don’t help
Lightweight integration for the semantic web vs
E-commerce or data warehouse loading.
– The former can’t afford much human review
– The latter needs “perfect” mappings and hence
13
Automatic Match Approaches

Individual approaches
Schema-based
Element
Structure
Instance-based
Element
Linguistic
Constraintbased
Constraintbased
Linguistic
• Names
• Descriptions
• Types
• Keys
• Parents
• Children
• Leaves
• IR (word
frequencies,
key terms)

Reuse-oriented
Element
Structure
• Dictionaries
• Thesauri
• Previous match
results
Constraintbased
• Value pattern
and ranges
Combining approaches: hybrid vs. composite
Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
14
Match Quality Measures

Comparison of automatically with manually (i.e. real)
derived match correspondences
Real matches
A
Suggested matches
B
A: False Negatives
B: True Positives
C: False Positives
D: True Negatives
C
D

Quality measures:
Precision 
SimilarityFlooding [ICDE02]: Overall  1 

B
Recall 
BC
AC
A B

B
A B
B C
1


 Recall *  2 

Precision
A B


Overall : post-match effort to add missed (A) and to remove
false matches (C): negative Overall  no gain
Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik
15
UW & MSR Contributions
LSD [SIGMOD ]
– Learning algorithm based on structures and
instances
– AnHai Doan, Pedro Domingos, Alon Halvey
Cupid [VLDB 02]
– Structure matching
– Jayant Madhavan, Phil Bernstein
Glue [WWW 03]
– Taxonomy matching. Uses relaxation labeling.
– An Hai Doan, Jayant Madhavan, Pedro
Domingos,
Alon Halevy
Mapping Knowledge Base (MKB) [submitted]
16
Where’s the Research Action?
There’s always room for new techniques
– Compare distribution of data values between
elements of two schemas
– Create a global schema by clustering
elements from different sample schemas
Re-using mappings
Combining techniques
Better user interfaces
17
Descargar

Document