Data Integration
1
Motivating Examples
• An organization has on average 49 databases
– can talk about the same topic, but use different
vocabularies, different schemas
– how can we access them as if accessing a single db?
• Hundreds of online bookstores
– amazon.com, barnes&noble.com, etc.
– how can we query them as if querying a single
source?
• Hundreds of CS websites in US, in text format
– can we consolidate information about all of them and
query them as if querying a giant relational database?
2
Data Integration
• The general problem
– how can we access a set of heterogeneous,
distributed, autonomous databases as if
accessing a single database?
• Arises in numerous contexts
– on the Web, at enterprises, military, scientific
cooperation, bio-informatics domains, ecommerce, etc.
• Currently very hot
– in both database research and industry
3
What is Data Integration
• Providing
– Uniform (same query interface to all sources)
– Access to (queries; eventually updates too)
– Multiple (we want many, but 2 is hard too)
– Autonomous (DBA doesn’t report to you)
– Heterogeneous (data models are different)
– Structured (or at least semi-structured)
– Data Sources (not only databases).
4
The Problem: Data Integration
mybooks.com Mediated Schema
Books
Internet
Inventory
Orders
WAN
MorganKaufman
PrenticeHall
...
Shipping
Internet
East
West
Orders
Reviews
Internet
FedEx
Customer
Reviews
UPS
NYTimes
...
alt.books.
reviews
Uniform query capability across autonomous, heterogeneous
data sources on LAN, WAN, or Internet
5
Motivation(s)
• Enterprise data integration; web-site construction.
• WWW:
– Comparison shopping
– Portals integrating data from multiple sources
– B2B, electronic marketplaces
• Science and culture:
– Medical genetics: integrating genomic data
– Astrophysics: monitoring events in the sky.
– Environment: Puget Sound Regional Synthesis Model
– Culture: uniform access to all cultural databases
produced by countries in Europe.
6
Current State of Affairs
• Mostly ad-hoc programming: create a special
solution for every case; pay consultants a lot of
money.
• Long-standing challenge in the DB community
• AI/WWW communities are on board
• Annual workshops, vision papers, ...
• Companies
– Informatica, many others, ...
7
A Brief History
• Many early ad-hoc solutions
• Converged into two approaches
– data warehousing vs. virtual DI systems
•
•
•
•
Semi-structured data, XML
Wrappers
Other issues: query optimization, schema matching, ...
Current directions
– DI for specialized domains (e.g., bioinformatics)
– on-the-fly DI, entity-centric DI
• New types of data sharing systems
– P2P systems, Semantic Web
8
Data warehousing vs.
Virtual DI systems
9
Data Warehouse Architecture
User queries
OLAP / Decision support/
Data cubes/ data mining
Relational database (warehouse)
Data extraction
programs
Data
source
Data cleaning/
scrubbing
Data
source
Data
source
10
Data warehousing
• Data warehousing: load all the data periodically
into a warehouse.
– 6-18 months lead time
– Separates operational DBMS from decision
support DBMS. (not only a solution to data
integration).
– Performance is good; data may not be fresh.
– Need to clean, scrub you data.
11
The Virtual Integration Architecture
• Leave the data in the sources.
• When a query comes in:
– Determine the relevant sources to the query
– Break down the query into sub-queries for the
sources.
– Get the answers from the sources, and combine
them appropriately.
• Data is fresh.
• Challenge: many
12
Virtual Integration Architecture
User queries
Mediated schema
Mediator:
Which data
model?
Reformulation engine
optimizer
Execution engine
Data source
catalog
wrapper
wrapper
wrapper
Data
source
Data
source
Data
source
13
Sources can be: relational, hierarchical (IMS), structure files, web sites.
Architecture of (Virtual) Data Integration
System
Find books written by Isaac
Asimov & priced under $15
global query interface
query interface 1
amazon.com
query interface 2
bn.com
query interface 3
powell.com
14
A Brief History
• Many early ad-hoc solutions
• Converged into two approaches
– data warehousing vs. virtual DI systems
•
•
•
•
Semi-structured data, XML
Wrappers
Other issues: query optimization, schema matching, ...
Current directions
– DI for specialized domains (e.g., bioinformatics)
– on-the-fly DI, entity-centric DI
• New types of data sharing systems
– P2P systems, Semantic Web
15
Semi-structured data
XML
16
Semi-structured Data
• What should be the underlying data model for DI
contexts?
– relational model is not an ideal choice
• Developed semi-structured data model
– started with the OEM (object exchange model)
• Then XML came along
• It is now the most well-known semi-structured data
model
• Generating much research in the DB community
17
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteboul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
HTML is hard for applications
18
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML describes the content: easy for applications19
DTDs as Grammars
Same thing as:
db
book
title
author
year
publisher
• A DTD is a EBNF (Extended
::= (book|publisher)*
::= (title,author*,year?)
::= string
::= string
::= string
::= string
BNF)
grammar
• An XML tree is precisely a derivation tree
XML Documents that have a DTD and conform to it are called valid
20
More on DTDs as Grammars
<!DOCTYPE paper [
<!ELEMENT paper (section*)>
<!ELEMENT section ((title,section*) | text)>
<!ELEMENT title
(#PCDATA)>
<!ELEMENT text
(#PCDATA)>
]>
<paper> <section> <text> </text> </section>
<section> <title> </title> <section> … </section>
<section> … </section>
</section>
</paper>
XML documents can be nested arbitrarily deep21
XML for Representing Data
XML:
persons
persons
row
nam e
row
phone
name
John
row
3634
“John”
phone name phone
3634 “Sue”
name
6343 “Dick”
phone
6363
<persons>
Sue
6343
D ic k
6363
<row> <name>John</name>
<phone> 3634</phone></row>
<row> <name>Sue</name>
<phone> 6343</phone>
<row> <name>Dick</name>
<phone> 6363</phone></row>
</persons>
22
XML vs Data Models
• XML is self-describing
• Schema elements become part of the data
– Relational schema: persons(name,phone)
– In XML <persons>, <name>, <phone> are part of the
data, and are repeated many times
• Consequence: XML is much more flexible
• XML = semistructured data
23
Semi-structured Data Explained
• Missing attributes:
<person> <name> John</name>
<phone>1234</phone>
</person>
<person> <name>Joe</name>
</person>
 no phone !
• Repeated attributes
<person> <name> Mary</name>
<phone>2345</phone>
<phone>3456</phone>
</person>
 two phones !
24
Semistructured Data Explained
• Attributes with different types in different objects
<person> <name> <first> John </first>
<last> Smith </last>
</name>
<phone>1234</phone>
</person>
 structured name !
• Nested collections (no 1NF)
• Heterogeneous collections:
– <db> contains both <book>s and <publisher>s
25
XML Data v.s. E/R, ODL, Relational
• Q: is XML better or worse ?
• A: serves different purposes
– E/R, ODL, Relational models:
• For centralized processing, when we control the data
– XML:
• Data sharing between different systems
• we do not have control over the entire data
• E.g. on the Web
• Do NOT use XML to model your data ! Use E/R, ODL,
or relational instead.
26
Exporting Relational Data to XML
product
makes
company
• Product(pid, name, weight)
• Company(cid, name, address)
• Makes(pid, cid, price)
27
Export data grouped by companies
<db><company> <name> GizmoWorks </name>
<address> Tacoma </address>
<product> <name> gizmo </name>
<price> 19.99 </price>
</product>
<product> …</product>
…
</company>
<company> <name> Bang </name>
<address> Kirkland </address>
<product> <name> gizmo </name>
<price> 22.99 </price>
</product>
…
</company>
…
</db>
Redundant
representation
of products
28
The DTD
<!ELEMENT db (company*)>
<!ELEMENT company (name, address, product*)>
<!ELEMENT product (name,price)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT address (#PCDATA)>
<!ELEMENT price (#PCDATA)>
29
Export Data by Products
<db> <product> <name> Gizmo </name>
<manufacturer>
<name> GizmoWorks </name>
<price> 19.99 </price>
<address> Tacoma </address>
</manufacturer>
<manufacturer>
<name> Bang </name>
<price> 22.99 </price>
<address> Kirkland </address>
</manufacturer>
…
</product>
<product> <name> OneClick </name> …
</db>
Redundant
Representation
of companies
30
Which One Do We Choose ?
• The structure of the XML data is determined by
agreement, with our partners, or dictated by committees
– Many XML dialects (called applications)
• XML Data is often nested, irregular, etc
• No normal forms for XML 
31
XML Query Languages
• Xpath
• XML-QL
• Xquery
32
A Brief History
• Many early ad-hoc solutions
• Converged into two approaches
– data warehousing vs. virtual DI systems
•
•
•
•
Semi-structured data, XML
Wrappers
Other issues: query optimization, schema matching, ...
Current directions
– DI for specialized domains (e.g., bioinformatics)
– on-the-fly DI, entity-centric DI
• New types of data sharing systems
– P2P systems, Semantic Web
33
Wrappers
Information Extraction
34
Virtual Integration Architecture
User queries
Mediated schema
Mediator:
Which data
model?
Reformulation engine
optimizer
Execution engine
Data source
catalog
wrapper
wrapper
wrapper
Data
source
Data
source
Data
source
35
Sources can be: relational, hierarchical (IMS), structure files, web sites.
Wrapper Programs
• Task: to communicate with the data sources and
do format translations.
• They are built w.r.t. a specific source.
• They can sit either at the source or at the mediator.
• Often hard to build (very little science).
• Can be “intelligent”: perform source-specific
optimizations.
36
Example
Transform:
<b> Introduction to DB </b>
<i> Phil Bernstein </i>
<i> Eric Newcomer </i>
Addison Wesley, 1999
into:
<book>
<title> Introduction to DB </title>
<author> Phil Bernstein </author>
<author> Eric Newcomer </author>
<publisher> Addison Wesley </publisher>
<year> 1999 </year>
</book>
37
Wrapper Construction
• Huge amount of research in the past decade
• Two major approaches
– machine learning: typically requires some hand-labeled data
– data-intensive, completely automatic
• Different focuses
– pull out each record (i.e., segment page into records)
– pull out fields in each record
– remove junk portions (ads, etc.)
• Current solutions are still brittle
• Unclear whether “standards” such as XML & Web services will
eliminate the problem
– the need likely will still remain
38
Information Extraction
• If the source cannot be wrapped with a grammar
or some easy-to-parse rules
– must do information extraction
• Huge research in the AI community
39
A Brief History
• Many early ad-hoc solutions
• Converged into two approaches
– data warehousing vs. virtual DI systems
•
•
•
•
Semi-structured data, XML
Wrappers
Other issues: query optimization, schema matching, ...
Current directions
– DI for specialized domains (e.g., bioinformatics)
– on-the-fly DI, entity-centric DI
• New types of data sharing systems
– P2P systems, Semantic Web
40
Other Issues
41
Data Source Catalog
• Contains all meta-information about the sources:
– Logical source contents (books, new cars).
– Source capabilities (can answer SQL queries)
– Source completeness (has all books).
– Physical properties of source and network.
– Statistics about the data (like in an RDBMS)
– Source reliability
– Mirror sources
– Update frequency.
42
Content Descriptions
• User queries refer to the mediated schema.
• Data is stored in the sources in a local schema.
• Content descriptions provide the semantic
mappings between the different schemas.
• Data integration system uses the descriptions to
translate user queries into queries on the sources.
43
Desiderata from Source Descriptions
• Expressive power: distinguish between sources
with closely related data. Hence, be able to prune
access to irrelevant sources.
• Easy addition: make it easy to add new data
sources.
• Reformulation: be able to reformulate a user query
into a query on the sources efficiently and
effectively.
44
Reformulation Problem
• Given:
– A query Q posed over the mediated schema
– Descriptions of the data sources
• Find:
– A query Q’ over the data source relations, such
that:
• Q’ provides only correct answers to Q, and
• Q’ provides all possible answers from to Q given the
sources.
45
Approaches to Specifying Source
Descriptions
• Global-as-view: express the mediated schema
relations as a set of views over the data source
relations
• Local-as-view: express the source relations as
views over the mediated schema.
• Can be combined with no additional cost.
46
Global-as-View
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS
select * from S1 [S1(title,dir,year,genre)]
union
select * from S2 [S2(title, dir,year,genre)]
union
[S3(title,dir), S4(title,year,genre)]
select S3.title, S3.dir, S4.year, S4.genre
from S3, S4
where S3.title=S4.title
47
Global-as-View: Example 2
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create View Movie AS [S1(title,dir,year)]
select title, dir, year, NULL
from S1
union
[S2(title, dir,genre)]
select title, dir, NULL, genre
from S2
48
Global-as-View: Example 3
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)
Create View Movie AS
select NULL, NULL, NULL, genre
from S4
Create View Schedule AS
select cinema, NULL, NULL
from S4.
But what if we want to find which cinemas are playing comedies?
49
Global-as-View Summary
•
•
•
•
•
Query reformulation boils down to view unfolding.
Very easy conceptually.
Can build hierarchies of mediated schemas.
You sometimes loose information. Not always natural.
Adding sources is hard. Need to consider all other
sources that are available.
50
Local-as-View: example 1
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Create Source S1 AS
select * from Movie
Create Source S3 AS [S3(title, dir)]
select title, dir from Movie
Create Source S5 AS
select title, dir, year
from Movie
where year > 1960 AND genre=“Comedy”
51
Local-as-View: Example 2
Mediated schema:
Movie(title, dir, year, genre),
Schedule(cinema, title, time).
Source S4: S4(cinema, genre)
Create Source S4
select cinema, genre
from Movie m, Schedule s
where m.title=s.title
.
Now if we want to find which cinemas are playing comedies,
there is hope!
52
Local-as-View Summary
• Very flexible. You have the power of the entire
query language to define the contents of the
source.
• Hence, can easily distinguish between contents of
closely related sources.
• Adding sources is easy: they’re independent of
each other.
• Query reformulation: answering queries using
views!
53
The General Problem
• Given a set of views V1,…,Vn, and a query Q, can
we answer Q using only the answers to V1,…,Vn?
• Many, many papers on this problem.
• The best performing algorithm: The MiniCon
Algorithm, (Pottinger & Levy, 2000).
• Great survey on the topic: (Halevy, 2001).
54
Query Optimization
• Very related to query reformulation!
• Goal of the optimizer: find a physical plan with minimal
cost.
• Key components in optimization:
– Search space of plans
– Search strategy
– Cost model
55
Optimization in Distributed DBMS
• A distributed database (2-minute tutorial):
– Data is distributed over multiple nodes, but is
uniform.
– Query execution can be distributed to sites.
– Communication costs are significant.
• Consequences for optimization:
– Optimizer needs to decide locality
– Need to exploit independent parallelism.
– Need operators that reduce communication
costs (semi-joins).
56
DDBMS vs. Data Integration
• In a DDBMS, data is distributed over a set of
uniform sites with precise rules.
• In a data integration context:
– Data sources may provide only limited access
patterns to the data.
– Data sources may have additional query
capabilities.
– Cost of answering queries at sources unknown.
– Statistics about data unknown.
– Transfer rates unpredictable.
57
Descargar

Database Design - Amirkabir University of Technology