Crossing the Structure Chasm
Alon Halevy
University of Washington
FQAS 2002
Outline

The structure chasm:



Crossing the chasm at the U. of Washington:




Old problem
Reasons for renewed interest (Semantic Web).
Getting people to structure their data (Darwin)
Large-scale data sharing by Peer-data
management (Piazza)
What can you do with a corpus of 1 million
schemas? (Let’s build it and see).
Challenges going forward.
The Structure Chasm
The U-World
The S-World
 Authoring: easy, learned  Authoring: need to
in grade-school.
design structure first.
Learned in college.
 Querying: I need to
 Querying: easy,
know how you
keywords, I don’t need
structured your data.
to know the database.
 Data sharing:
 Data sharing: put our
negotiations, possibly
documents into the
committee work.
same corpus/search
engine.
 Results: approximate.
 Results: need to be
For human viewing.
precise; may affect
bank accounts.
Why Should We Care?

The chasm is limiting the use of S-World
technology:



Losing potential customers, applications.
Being ignored by friends and family.
People often end up using more
lightweight solutions:


Causes loss of functionality
No easy migration path back to the S-World.
Creating The Semantic Web

We want a web of structured data,
where searches are more meaningful:




It’s a gigantic distributed database.
Content authors are not DB/KR specialists.
The SW has not taken off yet precisely
because of the chasm.
Claim: We do not know how to build
large-scale data sharing systems.
Crossing the Structure Chasm




Goal: import some of the nice properties
of the U-world into the S-world.
Make authoring, querying and sharing of data
easier.
No illusions: The S-world will always be harder
than the U-world.
Some non-solutions:


(from KR people): more expressive representation
languages.
(from DB people): didn’t XML solve this problem?
Pause
Crossing the Chasm with Revere
The Corpus
Entice people to structure their data
Annotated
HTML
Corpusbased
Design
Tools
Schema
Mapping
Peer 1
Schem
a
Storage
ma
he g
Sc ppin
Ma
Peer 3
Sc
M a hem
pp a
i ng
Storage
Ma
p
pi n
gs
Enable sharing of data without
central control Peer 2
Schema
ma
ma
Darwin
Sc
he
Sc
he
HTML
U2S
Content
Annotation
Tool
Statistics over
Structure
Corpus of
Structured Data
Schema
Query
over Peer4
Schema
Storage
Peer 4
Schema
ma
Sche g
in
p
M ap
Storage
Results from
All Mapped
Peers' Stored
Data
Piazza
Tools for
facilitating
authoring and
sharing of data
Crossing the Chasm w/ Revere

Key components:




Darwin: Get people to structure their data.
Piazza: Enable people to share their data.
Statistics over structures: Import the main
technique of the U-world into the S-world.
Goal: create infrastructure for building
semantic web applications:

First case study: creating a SW from data
that is already on people’s web pages.
Outline

The structure chasm:



Old problem
Reasons for renewed interest (Semantic Web).
Crossing the chasm at the U. of Washington:



Getting people to structure their data (Darwin)
Large-scale data sharing by Peer-data
management (Piazza)
What can you do with a corpus of 1 million
schemas? (Let’s build it and see).
Darwin: an evolutionary
approach to the semantic web
Joint work with: Etzioni, Gribble, Levy, McDowell, Vlasseva

Two challenges:



Can we create conditions to entice people to
create semantic content?
Can a database evolve rather than being created
in the traditional fashion?
Goal: create a semantic web from data that is
already on web pages: events, contact info,…


Large number of very heterogeneous web pages.
Wrapper technology does not apply.
Accessing at query-time not scalable.
Key Ideas of Darwin

Make it easy: tool for annotating HTML pages


Immediate gratification:




No need to replicate data.
A set of applications that provide immediate
benefit (calendar, phone book)
Illustrate that even partial data is useful.
Defer checking integrity constraints
Start local, reach out to others later.
Darwin and the Chasm


Addresses first step: getting data into
structured form.
Challenges:



How to entice people to create content
How to evolve a database/knowledge
base?
How to do this in a scalable fashion.
Outline

The structure chasm:



Old problem
Reasons for renewed interest (Semantic Web).
Crossing the chasm at the U. of Washington:



Getting people to structure their data (Darwin)
Large-scale data sharing by Peer-data
management (Piazza)
What can you do with a corpus of 1 million
schemas? (Let’s build it and see).
Large-Scale Data Sharing
(With Ives, Mork, Suciu, Tatarinov)


Goal: to share structured data across
multiple autonomous sites.
Current solution: data integration



Query a set of data sources through a
mediated schema.
Use XML as a data sharing format, and
XQuery.
Information Manifold (96), Tukwila (99),
Nimble Technology (www.nimble.com).
Data Integration Systems
mybooks.com Mediated Schema
Books
Internet
Inventory
Orders
WAN
MorganKaufman
PrenticeHall
...
Shipping
Internet
East
West
Orders
Reviews
Internet
FedEx
Customer
Reviews
UPS
NYTimes
...
alt.books.
reviews
Limitations of Data Integration

The mediated schema:





Creating it is hard, often infeasible.
Mapping to it may involve repetitive work.
Querying it can be hard for users familiar
with their own schema.
Note: much better than warehousing.
Goal: share data without a single
mediated schema.
Peer Data-Management


PDMS: a network of peers
Peers can:





Export base data
Provide views on base data
Serve as logical mediators for other peers
A peer can be both a server and a client.
Semantic relationships are specified locally
(between small sets of peers).
Advantages of PDMS




No need for a central mediated schema.
Can map data opportunistically, as is most
convenient.
Queries are posed using the peer’s schema.
Answers come from anywhere in the system.
Relationship to peer-to-peer file sharing:


Data has rich semantics
Probably not as dynamic in membership.
Example PDMS
911 Dispatch
Hospitals
Center
(9DC)
(H)
LH:CritBed(bed, hosp, room, PID, status) 
H:CritBed(bed, hosp, room),
H:Patient(PID, bed, status)
First
Hospitals
Hospital (FH)
(H)
Fire
Services (FS)
Lakeview
Hospital (LH)
Portland
Fire District (PFD)
First
Hospital (FH)
Lakeview
Hospital (LH)
Vancouver Fire
District (VFD)
...
Station 3 Station 19 Station 12 Station 32
Ad-hoc Additions to a PDMS
Earthquake
Command
Center (ECC)
911 Dispatch
Center (9DC)
Medical
Aid (MA)
Search &
Rescue (SR)
Fire
Services (FS)
Hospitals
(H)
Emergency
Workers (EW)
Portland
Fire District (PFD)
First
Hospital (FH)
Lakeview
Hospital (LH)
Vancouver Fire
District (VFD)
...
Station 3 Station 19 Station 12 Station 32
National
Guard
Washington
State
PDMS Research Directions

Schema mediation:




Languages for specifying mappings.
Algorithms for answering queries.
Easy generation of mappings.
Efficiency and optimization:



Avoiding redundant paths, following best ones.
Propagating updates efficiently (w/ Mork, Gribble).
Distributed indexing of views (Dalvi, Suciu).
Schema Mediation in PDMS


The formalism for the semantic glue.
From data integration, we have:



Global-as-view (GAV): mediated schema is defined
as views over the sources [query composition].
Local-as-view (LAV): sources are defined as views
over mediated schema [answering q’s u/views]
GLAV: a combination of both:



Qsource = Qschema
Qsource  Qschema
Query answering is understood for a two-tier
network: a mediator over multiple sources.
Hospitals
(H)
LH:CritBed(bed, hosp, room, PID, status) 
H:CritBed(bed, hosp, room),
H:Patient(PID, bed, status)
First
Hospital (FH)
Lakeview
Hospital (LH)
Mediation: the Relational Case

A mediation language that uses GLAV
locally.



Precise conditions for when global
query answering in a PDMS is
tractable/decidable.
A query answering algorithm that
combines chains of query composition
and answering queries using views.
See ICDE-03 paper for details.
Mediation: the XML Case

Mediation language:



XQuery is inappropriate.
Our language allows incremental
specification of mappings. Uses subset of
XQuery.
Query answering algorithm:

New techniques for answering queries
using views:


Challenge: nesting in XML structure.
Implementation: based on XML.
Additional Mediation Issues

Mapping composition:



Given A-B and B-C mappings, is there an A-C
mapping that doesn’t lose information?
Yes, and no. Even when yes, it may be infinite. [w/
Madhavan].
Basic framework and properties of mappings:


KR community needs to consider mappings as
first-class citizens.
See [AAAI-02].
Piazza and the Chasm

Enable data ad-hoc large-scale data
sharing.


No need for central control or schema.
Open issues:




Optimization (follow only good paths?)
Annotations on mappings?
Intelligent data placement.
Update propagation.
Outline

The structure chasm:



Old problem
Reasons for renewed interest (Semantic Web).
Crossing the chasm at the U. of Washington:



Getting people to structure their data (Darwin)
Large-scale data sharing by Peer-data
management (Piazza)
What can you do with a corpus of 1 million
schemas? (Let’s build it and see).
Corpus Based Tools

Information retrieval works by:



Large corpora of text
Statistics over word occurrences in texts.
Can we do the same in the S-World?


Create a corpus of schemas.
Use it to build tools that facilitate
authoring, querying and sharing data.
The Corpus

Contents:


Schemas, ontologies, meta-data, data,
queries.
Sample statistics:



How often does a word appear as a
relation name?
When it does, what tend to be the
attribute names?
What other tables are there? What are the
foreign keys?
Sample Tools

Auto-complete:


Schema matcher:


I start creating a schema, and the tools suggests a
completion (perhaps I start only with data, not
schema).
I can map two between two schemas by relating
them both to the corpus.
Query reformulator:

I ask a query using my terminology, and the tools
reformulates it to a particular database schema.
Why are we Optimistic?

Because of our work on LSD and GLUE
[w/ Doan, Domingos, Madhavan]:



We computed classifiers for attributes of
schemas.
Classifiers are a particular kind of statistic.
This is a huge community project:


We need your help
Or at least, your schemas
Summary

Takeaway questions:




How can we entice people to structure data?
Can we generalize Peer-to-peer systems to
structured data?
What can we do with a corpus of schemas, and
how can we build it?
For more details,
www.cs.washington.edu/homes/alon

[CIDR-03], [AAAI-02], [ICDE-03], [WWW-02],
[SIGMOD-01].
Descargar

Crossing the Structure Chasm