CSE 636
Data Integration
Data Integration Approaches
Fall 2006
Virtual Integration Architecture
• Leave the data in the sources
• When a query comes in:
– Determine the relevant sources to the query
– Break down the query into sub-queries for the sources
– Get the answers from the sources, filter them if needed
and combine them appropriately
• Data is fresh
• Otherwise known as
On Demand Integration
2
Virtual Integration Architecture
Run-Time
Design-Time

Mapping Tool
Mediation
Language
Mediator
Web Services
1
Query
Reformulation
Query
Result
End User
Global
Schema
Optimization
& Execution
XML
Wrapper
Local
Data
Schema
Source
Wrapper
Local
Data
Schema
Source
3
Virtual Integration Architecture
Run-Time
Design-Time

Mapping Tool
Mediation
Language
Mediator
2 Web Services
1
Query
Reformulation
Query
Result
End User
Global
Schema
Optimization
& Execution
XML
Wrapper
Local
Data
Schema
Source
Wrapper
Local
Data
Schema
Source
4
Virtual Integration Architecture
Run-Time
Design-Time

Mapping Tool
3
Mediation
Language
Mediator
2 Web Services
1
Query
Reformulation
Query
Result
End User
Global
Schema
Optimization
& Execution
XML
Wrapper
Local
Data
Schema
Source
Wrapper
Local
Data
Schema
Source
5
Virtual Integration Architecture
Run-Time
Design-Time
4
Mapping Tool
3
Mediation
Language

Mediator
2 Web Services
1
Query
Reformulation
Query
Result
End User
Global
Schema
Optimization
& Execution
XML
Wrapper
Local
Data
Schema
Source
Wrapper
Local
Data
Schema
Source
6
Virtual Integration Architecture
Run-Time
Design-Time
4
Mapping Tool
3
Mediation
Language

Mediator
2 Web Services
1
5
Query
Result
End User
Global
Schema
Query
Reformulation
Optimization
& Execution
XML
Wrapper
Local
Data
Schema
Source
Wrapper
Local
Data
Schema
Source
7
Virtual Integration Architecture
Run-Time
Design-Time
4
Mapping Tool
3
Mediation
Language

Query
Result
End User
Mediator
2 Web Services
1
Global
Schema
5
Query
Reformulation
6
Optimization
& Execution
XML
Wrapper
Local
Data
Schema
Source
Wrapper
Local
Data
Schema
Source
8
Virtual Integration Approaches
Dimensions to Consider:
• How many sources are we accessing?
• How autonomous are they?
• Meta-data about sources?
• Is the data structured?
• Queries or also updates?
• Requirements: accuracy, completeness,
performance, handling inconsistencies.
• Closed world assumption vs. open world?
9
Mediation Languages
Global Schema
CD
ASIN
Title
Genre
…
Artist
ASIN
Name
…
Logic
CDs
Books
CDCategories
BookCategories
Album
ASIN
Price
DiscountPrice
Studio
ASIN
Category
Title
ISBN
Price
DiscountPrice
Edition
ISBN
Category
Authors
ISBN
FirstName
LastName
Artists
ASIN
ArtistName
GroupName
10
Desiderata from Source Descriptions
• Expressive power: distinguish between sources
with closely related data. Hence, be able to
prune access to irrelevant sources.
• Easy addition: make it easy to add new data
sources.
• Reformulation: be able to reformulate a user
query into a query on the sources efficiently
and effectively.
11
Source Descriptions
Elements of source descriptions:
• Contents: source contains movies, directors,
cast.
• Constraints: only movies produced after 1965.
• Completeness: contains all American movies.
• Capabilities:
– Negative: source requires movie title or
director as input
– Positive: source can perform selections, joins,
…
12
Approaches to Specification of Source
Descriptions
• Global-as-View (GAV):
Mediator relation defined as a view over source
relations
Ex: TSIMMIS (Stanford), HERMES (Maryland).
• Local-as-View (LAV):
Source relation defined as view over mediator
relations
Ex: Information Manifold (AT&T), Tukwila(UW),
InfoMaster (Stanford).
• GLAV: combines both (Friedman & Millstein
1999)
13
Approaches to Specification of Source
Descriptions
Q
Global
MediatorSchema
Schema
Mediated
GAV
LAV
Q’
Q’
GLAV
Q’
Q’
Q’
Source
Source
Source
Source
Source
Local
Schema
Local
Schema
Local
Schema
Local
Schema
Local
Schema
14
Global-as-View (GAV)
Global Schema:
Movie(title, dir, year, genre)
Schedule(cinema, title, time)
Integrating View of Movie:
SELECT * FROM S1
[S1(title,dir,year,genre)]
union
SELECT * FROM S2
[S2(title,dir,year,genre)]
union
SELECT S3.title, S3.dir, S4.year, S4.genre
FROM S3, S4
[S3(title,dir),
WHERE S3.title = S4.title
S4(title,year,genre)]
15
Global-as-View: Example 2
Global Schema:
Movie(title, dir, year, genre)
Schedule(cinema, title, time)
Integrating View of Schedule:
SELECT title, dir, year, NULL
FROM S1
[S1(title,dir,year)]
union
SELECT title, dir, NULL, genre
FROM S2
[S2(title,dir,genre)]
16
Global-as-View: Example 3
Global Schema:
Movie(title, dir, year, genre)
Schedule(cinema, title, time)
Integrating Views:
SELECT NULL, NULL, NULL, genre
FROM S4
[S4(cinema, genre)]
SELECT cinema, NULL, NULL
FROM S4
[S4(cinema, genre)]
17
Global-as-View (GAV): Example 4
Global Schema:
MovieActor(title,actor)
MovieReview(title, review)
Integrating Views:
MovieActor(title,actor) ← S1(id,title,actor,year)
union
MovieActor(title,actor) ←
S2(title,director,actor,year)
MovieReview(title, review) ←
S1(id,title,actor,year), S3(id,review)
18
Query Reformulation in GAV
Query reformulation= rule unfolding+simplification
Query: Find reviews for ‘DeNiro’ movies
q(title,review) :- MovieActor(title,‘DeNiro’),
MovieReview(title,review)
1. q’(title,review) :- S1(id,title,‘DeNiro’, year),
S3(id,review)
2. q’(title,review) :S2(title,director,‘DeNiro’,year),
S1(id,title, ‘DeNiro’,year), S3(id,review)
19
Global-as-View Summary
• Query reformulation boils down to view
unfolding.
• Very easy conceptually.
• Can build hierarchies of global schemas.
• You sometimes loose information. Not always
natural.
• Adding sources is hard. Need to consider all
other sources that are available.
20
Local-as-View (LAV)
Create
SELECT
FROM
WHERE
AND
View R1 AS
B.ISBN, B.Title, A.Name
Book B, Author A
A.ISBN = B.ISBN
B.Year < 1970
Mediator
Create
SELECT
FROM
WHERE
View R5 AS
B.ISBN, B.Title
Book B
B.Genre = ‘Humor’
Global Schema
Book
Books before 1970
ISBN
Title
Genre
Year
Author
Humor Books
ISBN
Name
Mediated Schema
Source
1
Source
2
Source
3
Source
4
Source
5
Local Schema
Local
Schema
Local
Schema
Local
Schema
Local Schema
R1
ISBN
Title
Name
R5
ISBN
Title
21
Reformulation Problem
Given:
• A query Q posed over the global schema
• Descriptions of the data sources
Find:
• A query Q’ over the data source relations, such
that:
– Q’ provides only correct answers to Q, and
– Q’ provides all possible answers to Q given the sources.
22
Query Reformulation
Query: Find authors of humor books
Plan: R1 Join R5
Mediator
Global Schema
Book
Books before 1970
ISBN
Title
Genre
Year
Author
Humor Books
ISBN
Name
Mediated Schema
Source
1
Source
2
Source
3
Source
4
Source
5
Local Schema
Local
Schema
Local
Schema
Local
Schema
Local Schema
R1
ISBN
Title
Name
R5
ISBN
Title
23
Query Reformulation
Query: Find authors of humor books after 1970
Plan: Can’t do it!
Mediator
Global Schema
Book
Books before 1970
ISBN
Title
Genre
Year
Author
Humor Books
ISBN
Name
Mediated Schema
Source
1
Source
2
Source
3
Source
4
Source
5
Local Schema
Local
Schema
Local
Schema
Local
Schema
Local Schema
R1
ISBN
Title
Name
R5
ISBN
Title
24
Local-as-View: Example 1
Global Schema:
Movie(title, dir, year, genre)
Schedule(cinema, title, time)
Source Views:
Create Source S1 AS
[S1(title, dir, year, genre)]
SELECT * FROM Movie
Create Source S3 AS
[S3(title, dir)]
SELECT title, dir FROM Movie
Create Source S5 AS
[S5(title, dir, year)]
SELECT title, dir, year
FROM Movie
WHERE year > 1960 AND genre=‘Comedy’
25
Local-as-View: Example 2
Global Schema:
Movie(title, dir, year, genre)
Schedule(cinema, title, time)
Source Views:
Create Source S4
[S4(cinema, genre)]
SELECT cinema, genre
FROM Movie M, Schedule S
WHERE M.title=S.title
26
Local-as-View (LAV): Example 3
Global Schema:
Movie(title,year,director,genre)
American(director)
MovieReview(title, review)
Source Views:
S1(title, year, director)→
Movie(title,year,director,genre),
American(director), year ≥1960,
genre= ‘Comedy’
S2(title, review)→Movie(title,year,director,genre),
year≥1990, MovieReview(title, review)
27
Query Reformulation in LAV
• Query: Reviews for comedies produced after
1950
q(title,review) :- Movie(title,year,director,’Comedy’),
year ≥1950, MovieReview (title,review)
• Reformulated query:
q’(title,review) :- S1(title,year,director), S2(title, review)
S1(title, year, director)→
Movie(title,year,director,genre), American(director),
year ≥1960, genre= ‘Comedy’
S2(title, review)→Movie(title,year,director,genre),
year≥1990, MovieReview(title, review)
28
Local-as-View Summary
• Very flexible. You have the power of the entire
query language to define the contents of the
source.
• Hence, can easily distinguish between contents
of closely related sources.
• Adding sources is easy: they’re independent of
each other.
• Query reformulation: answering queries using
views!
29
LAV vs. GAV
See [Ullman,ICDT-1997] for a detailed
comparison.
• Local-as-View:
– Easier to add sources: specify the query
expression.
– Easier to specify constraints on contents of the
sources: they are part of the query expression
describing them.
• Global-as-View:
– Easier query reformulation
• GLAV combines both (Friedman & Millstein
1999)
30
The General Problem
• Given a set of views V1,…,Vn, and a query Q,
can we answer Q using only the answers to
V1,…,Vn?
• Many, many papers on this problem
• The best performing algorithm:
The MiniCon Algorithm
(Pottinger & Halevy, VLDB 2000)
31
Local Completeness Information
• If sources are incomplete, we need to look at
each one of them.
• Often, sources are locally complete.
• Movie(title, director, year) complete for years
after 1960, or for American directors.
• Question: given a set of local completeness
statements, is a query Q’ a complete answer to
Q?
32
Example
• Movie(title, director, year)
– complete after 1960
• Show(title, theater, city, hour)
• Query: find movies (and directors) playing in
Seattle:
SELECT M.title, M.director
FROM
Movie M, Show S
WHERE M.title=S.title
AND
city=‘Seattle’
• Complete or not?
33
Example #2
• Movie(title, director, year), Oscar(title, year)
• Query: find directors whose movies won Oscars
after 1965:
SELECT M.director
FROM Movie M, Oscar O
WHERE M.title=O.title
AND M.year=O.year
AND O.year > 1965
• Complete or not?
34
References
• Information integration
– Maurizio Lenzerini
– Eighteenth International Joint Conference on Artificial
Intelligence, IJCAI 2003
– Invited Tutorial
• Data Integration: a Status Report
– Alon Halevy
– German Database Conference (BTW), 2003
– Invited Talk
35
Descargar

Query Languages for Semistructured Data