A Survey of Approaches to
Automatic Schema Matching
Name: Samer Samarah
Number: 3740535
Email: [email protected]
This Presentation based on the following
paper:
E. Rahm, P. Bernstein,"A Survey of
Approaches to Automatic Schema
Matching“, VLDB Journal, 10(4): 334-350
(2001).
1
Presentation Outlines






Definition of Schema Matching.
Schema Matching problem.
Applications of Schema Matching.
Schema matching approaches.
Personal contribution.
Conclusion.
2
Schema Matching: Definition
 Schema Matching: is the process of finding
semantic correspondence between
elements of two schemas.
 Schema Matching is achieved through
Match operation.
 Match Operator is a function that takes
two schemas as input and returns a
mapping between those two schemas as
output, called the match result.
Match(S1,S2)  Match Result.
3
The Match operator
Match(S1,S2)  Match Result
 The schema (either S1 or S2) defined to be a
set of elements connected by some structure
(ER model, OO model,..).
 The Match Result is a set of mapping elements,
each of which indicate that certain elements of
S1 are mapped to certain elements of S2,
expressed by(  ).
 A mapping expressions, which specifies how the
S1 and S2 elements are related, may be
associated with the mapping elements.
4
The Match Operator
5
Schema Matching Problem
 Schema Matching currently performed
manually which makes it:
 Error-prone.
 Tedious.
 Time-consuming.
 So, the solution is to automate the match
function, but
 There is no mathematical model that capture the
matching process.
 Application dependent.
6
Schema Matching Applications





Schema Integration.
Data Warehouses.
E-commerce.
Semantic Query Processing.
Data Integration system
7
1-Schema Integration
 Is the process of constricting a global
view from a set of independently
developed schema.
 Schema integration achieved through
identifying the interschema
relationships (applying schema
matching) then unify these matched
elements.
8
Schema Matching
Example
Example:
S1
{SNam,….}
S2
{FName,LName,
…}
.
.
Sn
Match + Unify
Sg
{Name,…}
Determining the correspondence
between SName and Fname,Lname
achived through the match
operation.
9
2-Data Warehouses
 DWH is a decision support database that is
extracted from a set of data sources.
 DWH and sources represent data in
different format (i.e. relations or XML
versus multidimensional view)
 Constructing DWH require transform data
from its format into DWH format.
 match operation can be used to identify
those elements in the sources that are
represented in the DWH, according to this
mapping appropriate transformation can be
designed.
10
3- E-commerce
 Trading partners exchange messages
that describes business transactions.
 Each partners uses its own message
format (EDI, XML,..) or different
message schema.
 In order to exchange messages, there
is a need to translate the messages
to the format required by different
partners (Matching problem).
11
4- Semantic Query Processing
 A run time scenarios where the user
specifies the output of the query (in terms
of concepts familiar to him, which may be
not the same concepts presented in the DB
{the Select Clause}), and the system
figures out how to produce that output (i.e.
the from and where clauses in SQL).
 The match operation is used to determent
the mapping between the user concepts
and the DB concepts).
12
4-Semantic Query Processing
(Example)
Example
All employees earn more than 2000$
Applying Match
Producing New Query in SQL
format:
{Select FName, LName
From Employee
Where Salary >2000}
Mapping elements
SQL Q
Emplyee(FName,LName,Salary)
Output
13
5- Data Integration System
 The major component of data
integration system is the source
description.
 Source description maps the sources
schema to the mediated schema.
 Match operation applied in order to
specify this mapping.
14
Generic Match Architecture
 Schemas to be matched are represented
in a uniform internal representation.
 Importer translates input schemas from
their native representation into the
internal representation.
 Exporter translates the match result
produced by the match from the internal
representation into the representation
required by each tool.
15
Generic Match Architecture
Tool 1
(Portal Schemas)
Tool 2
(E-business Schemas)
Tool 3
(DWH Schemas)
Tool 4
(DB Schemas)
Schema import / export
Global Libraries
(dictionaries,
schemas,..)
Generic Match Implementation
Internal Schema Representation
16
Classification of schema
matching approaches
 Instance Vs schema: consider instance data or
schema information.
 Element Vs structure: matching performed for
individual schema element (attribute), or for
combinations of elements (structure).
 Language Vs constraint: use linguistic information
(names and textual description) or constraint
information (key, relationship)
 Matching cardinality: the overall match result may
relate one or more elements of one schema to one or
more elements of the other (1:1, 1:n, m:n).
 Auxiliary information: the use of auxiliary information
(dictionaries, pervious matching results, user input,..)
17
Classification of schema
matching approaches
Schema Matching Approaches
Individual Matcher
Schema Based
Element Level
Linguistic
Combine Matcher
Instance Based
Structure Level
Constraint
Constraint
Hybrid
Matcher
Element Level
Linguistic
Composite Matcher
Manual
Automatic
Constraint
18
Schema level matcher
 Consider only schema information, not
instance data, such as:






Name
Description
Data Type
Relationship (is-a, part-of)
Constraints
Schema structure
 Multiple match candidates could be
founded, each of which assigned with a
similarity degree.
19
Granularity of match
 Element level matching: for each
element in the first schema, determines
the matching elements in the second
schema.
 Structural level matching: matching
combinations of elements that appear
together in a structure.
 Could be fully match or partial match.
20
Granularity of match
Example
S1 Elements
S2 Elements
Match Granularity

Address
CustomerAddress
Address  CustomerAddress (S.L full Match)
Street
City
Street
State
USState
State

ZIP
PostalCode
Zip
PostalCode (E.L)
AccountOwner
Name
Address
Birthdate

Customer


City (E.L)
USState (E. L)
AccountOwner  Customer (S.L partial Match)

Cname
Name
CAddress
Address
CName (E.L)

CAddress (E.L)
CPhone
TaxExempt
21
Linguistic approaches
 Linguistic matchers use names and
text (word or sentence) to find
semantically similar schema elements:
 Name Matching
 Description matching
22
Name Matching
 Name based matching matches schema
elements with equal names or similar names.
 Equality of names.
 Equality of canonical name representation
after stemming and preprocessing.
 Equality of synonyms.
 Equality of hypernyms (X is a hypernym of Y
if Y is a kind of X.
 Similarity based on common substring, edit
distance, soundex.
 User- provided name matches.
 Thesauri or dictionary should be exploited.
23
Name Matching
Example: two schema S1, S2 represent two automobile suppliers


S1 Elements
S2 Elements
Matching Based on
CarID
TruckID
Car  Truck (Hypernyms)
Car is an automobile and truck
is an automobile.
Brand
Make
Brand
Price
Price
Price  Price (Equality of Names)
Sold2
SoldTo
CustomerAddress
CAddress  CustomerAddress
(Preprocessing)
SoldTO
CAddress


Make (Synonyms)

Sold2 (Soundex)
24
Constraint- based approaches
 Exploit constraints information
associated with the input schemas to
determine the similarity of schema
elements.




Data types and domain constraints
Key characteristics (primary, unique)
Relationship cardinality
Structural constraints such as foreign key
(used by structural matches approaches)
25
Constraint- based approaches
S1 Elements
S2 Elements
Employee
Personal
EmpNo {int, PK}

EmpName {String}
Matching
Pno {int, PK}

Born  Birthdate {Type}
Pno  EmpNo| DeptNo {Key}
Pname {string}
Pname

DeptName {Type}
Several match
candidate could
results so this
approach could
be sued to
limit the
number of
candidate.
Pname  EmpName {Type}
Dept  DeptName {Type}
Dept
DeptNo {int, ref dep}
salary {single}
BirthDate {date}
Department
DeptNo {int,PK}
DeptName {String}
Dept {String}

Born {date}

EmpName
{Type}
S2.Personal {Pno, Pname, Dept, born} 
Select S1.Employee.EmpNo,
S1.Employee.EmpName,
S1.Department.DeptName,
S1.Employee.BirthDate
From S1.Employee, S1.Department
Where (S1.Employee.DeptNo =
S1.Department.DeptNo)
Note: Structural Matching
26
Description Matching
 Based on linguistic evaluation for the
comment associated with schema
elements.
Example:
S1: empn // employee name
S2: name // name of employee
NL Understanding
technology
Empn name
27
Reusing Schema and mapping
information
 This approach support and exploit the
reuse of common schema components
and previously determined mapping.
 Useful when matching applied to
different but similar schemas to the
same destination schemas.
28
Reusing Schema and mapping
information
EX:
Schema S1
Schema S2
Schema S
Purchase order2
Purchase order
POrder
Product
Product
Article
BillTo
BillTo
Payee
Name
Name
BillAddress
Address
Address
Recipient
ShipTo
ShipTo
Name
Name
Address
Address
ContactPhone
ShipAddress
Contact
Name
Goal: Mapping S1 to S
Address
matching result between S2
and S are previously
determent and can be
reused to map S1 to S
29
Match cardinality
 Global cardinality: how many mapping
elements S1 or (S2) elements can
participate in the matching results.
 Local cardinality: how many elements in S1
match how many elements in S2 within
individual mapping element.
 Most of approaches restricted to 1:1 local
and 1:1 or 1:n global cardinality.
30
Match cardinality
Example:
Local Match
Cardinality
S1
elements
S2
elements
Matching Expression
1
1:1 element level
Price
Amount
Amount = Price
2
n:1 element level
Price,Tax
Cost
Cost = Price* (1 +
Tax/100)
3
1:n element level
Name
FirstName
LastName
FirstName, LastName =
Extract(Name,..)
4
n:1 structure level
(n:m element level)
B.Title
B.PuNo
P.PuNo
P.Name
A.Book
A.Publisher
A.Book, A.Publisher =
Select B.Title, P.Name
From B.P
Where B.PuNo =P.PuNo
Price has
1:n Global
Cardinality
31
Instance level approaches
 Consider data contents.
 Useful when schema information limited (or
no schema at all).
 Enhance schema matching by considering
elements whose instances are more similar.
 Linguistic approach based on IR techniques for
text elements.
 Constraint based such as value range and
average for numeric element.
32
Instance level approach
Example:
EmpNo
Dept
SSN
Works for
234
Marketing
230
Accounting
235
Accounting
229
Marketing
236
Marketing
228
Marketing
{Dept ≈ works for} (based on “Marketing” Frequency)
{EmpNo ≈ SSN} (based on value range)
33
Combining matchers
 Combine several approaches to achieve good match
candidates.


Hybrid Matcher: combine several matching approaches
to determine match candidate based on multiple criteria
(name, type constraints).
 More effective (poor candidates filtered out early)
 Better performance (reducing number of pass over
the schema).
Composite matcher: combines the results of several
independently executed matchers, including hybrid
matchers.
 More flexible than hybrid matchers, (allow us to
select from set of matchers).
 The combination of results could be automatic or
manually).
34
personal contribution
 A prototype Datalog program designed to
implement a composite matcher.
 The program was tested on DES system (
Datalog Educational System) available at:
http//www.fdi.ucm.es/professor/fernan/DES/
 The program takes advantage of linguistic
based approach over constraint approach.
35
Database Description
 The DataBase contains the following
predicates:
 Source(ElementName, DataType,
constraints,…). // to describe sources
 Dictionary(Name1,Name2)// a dictionary
to provides synonyms.
36
The Program Rules

















cand1(N,N)
cand2(N,N1)
ok1(N)
ok1(N1)
ok2(N)
ok2(N1)
cand3(N,N1)
cand4(N,N1)
match(N,N1)
match(N,N1)
match(N,N1)
match(N,N1)
match(N,N1)
:- s1(N,D,C), s2(N,DataType,Constraint).
:- s1(N,D,C), s2(N1,DataType,Constraint), d(N,N1).
:- cand1(N,N1).
:- cand1(N,N1).
:- cand2(N,N1).
:- cand2(N,N1).
:- s1(N,D,C),s2(N1,D,Constraint),
not(ok1(N)),not(ok2(N1)),
not(ok1(N1)),not(ok2(N)).
:- s1(N,DataType,C),s2(N1,D,C),
not(ok1(N)),not(ok2(N1)),
not(ok1(N1)),not(ok2(N)).
:- cand1(N,N1).
:- cand2(N,N1).
:- cand3(N,N1),cand4(N,N1).
:- cand3(N,N1),not(cand4(N,N1)).
:- cand4(N,N1),not(cand3(N,N1)).
37
EXAMPLE
 The program was tested on the following schemas.
s1(eNo,integer,pk).
s1(city,string,20).
s1(street,string,30).
s1(state,string,12).
s2(eID,integer,pk).
s2(cName,string,15).
s2(street,string,10).
s2(province,string,25).
d(state, province).
38
Results
39
Conclusion
 A taxonomy for schema approaches was
presented, in order to compare between
different approaches to schema matching.
 The generic implementation described in
the paper, could be base for any new
implantation.
 Different techniques could be used in order
to automate schema matching, such as
Natural language, machine learning,IR ...
 Having full automated matching (without
user interaction ) not achieved yet.
40
Thank You
41
References
 E. Rahm, Ph.A. Bernstein,"A Survey of Approaches to
Automatic Schema Matching“, VLDB Journal, 10(4): 334350 (2001).
 DES system ( Datalog Educational System) available at:
http//www.fdi.ucm.es/professor/fernan/DES/.
 Jayant Madhavan, Philip A.Bernstein, Erhard Rahm,
“Genric Schema Matching with Cupid”, Proceedings of
27th VLDB conference, Roma, Italy,2001.
 Hong-Hai Do, Sergey Melnik, Erhard Rahm, “Comparison
of Schema Matching Evaluations”, University of Leipzig
Augustusplatz 10-11, 04109, Leipzig, Germany.
 AnHai Doan, Pedro Domigos, Alon Levy, “Learning
Source Description for Data Integration”, University of
Washington, Sattle, WA 98195
42
Appendix
1- Cupid (Microsoft Research)
2- LSD (Learning Source Description)
43
1- Cupid (Microsoft Research)
 A hybrid matcher based on element and
structure matching.
 What is new in this approach, that the
schemas represented as a graph which
encode the referential constraints into
structure the can be matched just like other
structures.
 The algorithm has two phase:
 Linguistic matching: matches individual schema
elements based on their names, data types,
domains,..
 Structural matching: matches schema elements
based on the similarity of their context.
44
2- LSD (Learning Source
Description)
 Composite matcher, with autonomic combination of
match results.
 An attempt to automate the mappings between
source schemas and mediated schema in data
integration system.
 Uses machine learning techniques to match a new
data source against previously determined global
schema.
 After a set of data sources have been manually
mapped to a mediated schema, the system should be
able to glean significant information from these
mapping for subsequent data sources.
45
Descargar

Document