DB & IR: Both Sides Now
Gerhard Weikum
[email protected]
http://www.mpi-inf.mpg.de/~weikum/
in collaboration with
Georgiana Ifrim, Gjergji Kasneci, Josiane Parreira, Maya Ramanath,
Ralf Schenkel, Fabian Suchanek, Martin Theobald
DB and IR: Two Parallel Universes
Database Systems
canonical
application:
data type:
foundation:
accounting
Information Retrieval
libraries
numbers,
text
short strings
parallel universes forever ?
algebraic /
probabilistic /
logic based
statistics based
search
paradigm:
Boolean retrieval
ranked retrieval
(exact queries,
result sets/bags)
(vague queries,
result lists)
market
leaders:
Oracle, IBM DB2,
MS SQL Server, etc.
Google, Yahoo!, MSN,
Verity, Fast, etc.
Gerhard Weikum June 14, 2007
2/41
Why DB&IR Now? – Application Needs
Simplify life for application areas like:
• Global health-care management for monitoring epidemics
• News archives for journalists, press agencies, etc.
• Product catalogs for houses, cars, vacation places, etc.
• Customer support & CRM in insurances, telcom, retail, software, etc.
• Bulletin boards for social communities
• Enterprise search for projects, skills, know-how, etc.
• Personalized & collaborative search in digital libraries, Web, etc.
• Comprehensive archive of blogs with time-travel search
Typical data:
Disease (DId, Name, Category, Pathogen …)
UMLS-Categories ( … )
Patient (… Age, HId, Date, Report, TreatedDId) Hospital (HId, Address …)
Typical query:
symptoms of tropical virus diseases and reported anomalies
with young patients in central Europe in the last two weeks
Gerhard Weikum June 14, 2007
3/41
Why DB&IR Now? – Platform Desiderata
Unstructured
search
(keywords)
Structured
search
(SQL,XQuery)
Keyword Search on
Relational Graphs
(IIT Bombay, UCSD, MSR, Hebrew U,
CU Hong Kong, Duke U, ...)
IR Systems
Search Engines
Integrated
DB&IR Platform
Querying entities &
DB Systems
Structured data (records)
relations from IE
(MSR Beijing, UW Seattle,
IBM Almaden, UIUC, MPI, … )
Unstructured data (documents)
Platform desiderata (from app developer‘s viewpoint):
• Flexible ranking on text, categorical, numerical attributes
• cope with „too many answers“ and „no answers“
• Ontologies (dimensions, facets) for products, locations, org‘s, etc.
• for query rewriting (relaxation, strengthening)
• Complex queries combining text & structured attributes
• XPath/XQuery Full-Text with ranking
• High update rate concurrently with high query load
Gerhard Weikum June 14, 2007
4/41
Why DB&IR Forever?
Turn the Web, Web2.0, and Web3.0 into the world‘s
most comprehensive knowledge base („semantic DB“) !
• Data enrichment at very large scale
• Text and speech are key sources of
knowledge production (publications, patents, conferences, meetings, ...)
indexed Web
Flickr photos
digital photos
Wikipedia
OECD researchers
patents world-wide
US Library of Congres
Google Scholar
2000
2007
2 Bio.
--?
8 000
7.4 Mio.
?
115 Mio.
---
20 Bio.
100 Mio.
150 Bio.
1.8 Mio.
8.4 Mio.
60 Mio.
134 Mio.
500 Mio.
Gerhard Weikum June 14, 2007
5/41
Outline
• Past : Matter, Antimatter, and Wormholes
• Present : XML and Graph IR
• Future : From Data to Knowledge
Gerhard Weikum June 14, 2007
6/41
Parallel Universes: A Closer Look
Matter
Antimatter
• user = programmer
• query = precise spec.
of info request
• interaction via API
• user = your kids
• query = approximation
of user‘s real info needs
• interaction process via GUI
• strength: indexing, QP
• weakness: user model
• strength: ranking model
• weakness: interoperability
• eval. measure: efficiency • eval. measure: effectiveness
(throughput, response time,
TPC-H, XMark, …)
(precision, recall, F1, MAP, NDCG,
TREC & INEX benchmarks, …
Gerhard Weikum June 14, 2007
7/41
Web Query
Languages:
DB
Uncertain &
Prob. Relations:
W3QS, WebOQL,
Araneus …
Prob. DB
(Cavallo&Pittarelli)
Mystiq, Trio …
Semistructured Data: Lore, Xyleme …
XPath
2nd Gen. XML IR:
1st Gen.
XRank,Timber, TIJAH,
XSearch, FleXPath,
WHIRL XML IR:
CoXML, TopX,
(Cohen) XXL,
Prob. Tuples
(Barbara et al.)
DB & IR: Both Sides Now
VAGUE
(Motro)
MarkLogic, Fast …
XIRQL,
Elixir,
JuruXML
INEX
Gerhard Weikum
XPath
Full-Text
Deep Web Search
Prob. Datalog
Digital Libraries
(Fuhr et al.)
[email protected]
Struct. Docs http://www.mpi-inf.mpg.de/~weikum/
Multimedia IR
Proximal Nodes
(Baeza-Yates et al.)
IR
1990
1995
Faceted Search:
Flamenco …
2000
Graph
IR
Web Entity
Search:
Libra, Avatar,
ExDB …
2005
WHIRL: IR over Relations [W.W. Cohen: SIGMOD’98]
Add text-similarity selection and join to relational algebra
Example: Select * From Movies M, Reviews R
Where M.Plot ~ ”fight“ And M.Year > 1990 And R.Rating > 3
And M.Title ~ R.Title And M.Plot ~ R.Comment
Movies
Reviews
Title
Plot
Matrix
In the near future …
computer
hacker Neofor
…
•
DB&IR
… fight training …
Hero
…
Year
Title
1999
Matrix 1
Comment
…
…
… cool fights …
new techniques …
integration
query-time data
Matrix
… fights …
•
More
recent
work:
MinorThird,
Spider,
DBLife,
etc.
Reloaded
and more
fights …
In ancient China … fights 2002
… fairly boring …
… sword
fight …
• But
scoring models fairly ad hoc
fights Broken Sword …
Shrek 2 In Far Far Away …
our lovely hero
fights with cat killer …
2004
s (<x,y>, q: A~B) = cosine (x.A, y.B)
m
s (<x,y>, q1  …  qm) = 
4
1
Matrix
… matrix spectrum
Eigenvalues … orthonormal …
5
… fight for peace …
… sword fight …
dramatic colors …
5
Ying xiong
aka. Hero
Scoring and ranking:
Rating
xj ~ tf (word j in x)  idf (word j)
with dampening & normalization
s( x , y  , qi )
i 1
Gerhard Weikum June 14, 2007
9/41
XXL: Early XML IR
[Anja Theobald, GW: Adding Relevance toXML, WebDB’00]
Union of heterogeneous sources without global schema
Similarity-aware XPath:
Which professors
from Saarbruecken (SB)
//~Professor [//* = ”~SB“]
are teaching IR and have
[//~Course [//* = ”~IR“] ]
research projects on XML?
[//~Research [//* = ”~XML“] ]
Lecturer
Professor
Name:
Gerhard
Weikum
Name:
Ralf
Schenkel
Address
...
Research
City: SB
Teaching
Country:
...
Germany
Seminar
Course
Project
Title:
Contents:
Title: IR
Intelligent Ranked
Syllabus
retrieval …
Search of
Description:
...
Heterogeneous Literature: …
Information Book
Article XML Data
retrieval ...
...
...
Funding: EU
Gerhard Weikum June 14, 2007
Activities
Address:
Max-Planck
Institute for
Informatics,
Germany
Scientific
Other
…
Name:
Sponsor:
INEX task
coordinator EU
(Initiative for the
Evaluation of XML …)
10/41
XXL: Early XML IR
[Anja Theobald, GW: Adding Relevance toXML, WebDB’00]
Motivation: Union of heterogeneous sources has no schema
Similarity-aware XPath:
Which professors
//~Professor [//* = ”~Saarbruecken“]
from Saarbruecken (SB)
are teaching IR and have
[//~Course [//* = ”~IR“] ]
research projects on XML?
[//~Research [//* = ”~XML“] ]
alchemist
Professor
primadonna
magician
artist director
wizard Name:
investigator
Lecturer
Name:
Scoring
and ranking: Activities
Ralf
Address:
Address
Schenkel
•
tf*idf
for
content
condition
Gerhard
Max-Planck
...
intellectual Weikum
Institute for for
RELATED (0.48)
• ontological similarity
Research
City: SB
Informatics,
Teaching
relaxed tag condition
Germany
professor Country:
researcher
...
Germany
Seminar
•
score
aggregationScientific
with
Other
Course HYPONYM (0.749)
Project
probabilisticName:
independence…
scientist
query expansion model: Title:
Contents:
Title:
IR scholarofSyllabus
disjunction
tags
mentor
Description:
academic, ...
academician,
Information
Book
Article
faculty
member
retrieval ...
...
...
lecturer
Sponsor:
Intelligent
Ranked
INEX task
retrieval
…
Search of
coordinator
Wu&Palmer:
|path|
through EU
lca(x,y)
teacher
(Initiative
for
the
Heterogeneous Literature: …
Dice coeff.: 2 #(x,y)
/ (#xof
+ XML
#y) on…)
Web
Evaluation
XML Data
Funding: EU
Gerhard Weikum June 14, 2007
11/41
The Past: Lessons Learned
precision
• DB&IR: added flexible ranking to (semi) structured querying
to cope with schema and instance diversity
• but ranking seems „ad hoc“ and
not consistently good in benchmarks
recall
• to win benchmark: tuning needed,
entity
substance
but tuning is easier if ranking is principled !
solid
food
• ontologies are mixed blessing:
produce
element
edible fruit
quality diverse, concept similarity subtle,
pome
danger of topic drift
apple
• ontology-based query expansion
(into large disjunctions)
poses efficiency challenge
Gerhard Weikum June 14, 2007
Golden Delicious
gold
// ~Professor [...]
// { Professor, Researcher,
Lecturer, Scientist,
Scholar, Academic, ... }[...]
12/41
Outline
 Past : Matter, Antimatter, and Wormholes
• Present : XML and Graph IR
• Future : From Data to Knowledge
Gerhard Weikum June 14, 2007
13/41
TopX: 2nd Generation XML IR
[Martin Theobald, Ralf Schenkel, GW: VLDB’05, VLDB Journal]
• Exploit tags & structure for better precision
• Can relax tag names & structure for better recall
• Principled ranking by probabilistic IR (Okapi BM25 for XML)
• Efficient top-k query processing (using improved TA)
• Robust ontology integration (self-throttling to avoid topic drift)
• Efficient query expansion (on demand, by extended TA)
• Relevance feedback for automatic query rewriting
”Semantic“ XPath Full-Text query:
/Article
[ftcontains(//Person, ”Max Planck“)]
[ftcontains(//Work, ”quantum physics“)]
//Children[@Gender = ”female“]//Birthdates
supported by TopX engine: http://infao5501.ag5.mpi-sb.mpg.de:8080/topx/
http://topx.sourceforge.net
Gerhard Weikum June 14, 2007
14/41
Commercial Break
[Martin Theobald, Ralf Schenkel, GW: VLDB’95]
TopX demo
today 3:30 – 5:30
Gerhard Weikum June 14, 2007
15/41
Principled Ranking by Probabilistic IR
binary features, conditional independence of features [Robertson & Sparck-Jones 1976]
related to but different from
„God does not play dice.“ (Einstein)
statistical language models
IR does.
s(d , q ) 
P [ d  R ( q ) | contents of d ]
P [ d  R ( q ) | contents of d ]

odds for item d with
terms di being relevant for
query q = {q1, …, qm}
P [ R |d ]
P [ R |d ]
Relationship
to tf*idf
pi
1  qi
P
[
d
|
R
]
m
i
•~ led
to
Okapi
BM25
(wins
TREC
tasks)
 i  1 P [ d i | R ] ~  i  q  d log 1  p  log q
• adapted and extended
 ki df ( k ) i
tf ( i , d ) to XML

 i log
inTopX,
...
 k (k , d )
with
pi  P [d i | R ]
q i  P [d i | R ]
df ( i )
Now estimate pi and qi values from
•relevance feedback,
•pseudo-relevance feedback,
•corpus statistics
by MLE (with statistical smoothing)
and store precomputed pi, qi in index
pˆ i  (# rel . docs ) /# docs
pˆ i 
tf ( i , d )
 k tf ( k , d )
q i  P [ d i | corpus ]
qˆ i 
Gerhard Weikum June 14, 2007
df ( i )
 k df ( k )
16/41
Probabilistic Ranking for SQL
[S. Chaudhuri, G. Das, V. Hristidis, GW: TODS‘06]
SQL queries that return many answers need ranking
Examples:
• Houses (Id, City, Price, #Rooms, View, Pool, SchoolDistrict, …)
Select * From Houses Where View = ”Lake“ And City In (”Redmond“, ”Bellevue“)
• Movies (Id, Title, Genre, Country, Era, Format, Director, Actor1, Actor2, …)
Select * From Movies Where Genre = ”Romance“ And Era = ”90s“
s(d , q ) 
P [ R |d ]
P [ R |d ]
~
P [d |R ]
P [d |R ]
P [Y | R ]
1


P [ X |Y ]
P [Y ]

P [ XY | R ]
P [ XY | R ]
odds for tuple d with
attributes XY relevant for
query q: X1=x1 …  Xm=xm
Estimate prob‘s, exploiting workload W: P [Y | R ]  P [Y | XW ]
Example: frequent queries
• … Where Genre = ”Romance“ And Actor1 = ”Hugh Grant“
• … Where Actor1 = ”Hugh Grant“ And Actor2 = ”Julia Roberts“
boosts HG and JR movies in ranking for Genre = ”Romance“ And Era = ”90s“
Gerhard Weikum June 14, 2007
17/41
From Tables and Trees to Graphs
[BANKS, Discover, DBExplorer, KUPS, SphereSearch, BLINKS]
Schema-agnostic keyword search over multiple tables:
graph of tuples with foreign-key relationships as edges
Example:
Conferences (CId, Title, Location, Year)
Journals (JId, Title)
CPublications (PId, Title, CId)
JPublications (PId, Title, Vol, No, Year)
Authors (PId, Person)
Editors (CId, Person)
Select * From * Where * Contains ”Gray, DeWitt, XML, Performance“ And Year > 95
Related use cases:
Result is connected tree with nodes that
contain
• XML
beyond trees
as many query keywords as possible • RDF graphs
• ER graphs (e.g. from IE)
Ranking:
1
•
social
networks


s ( tree , q )    
nodeScore ( n , q )  ( 1   )   1  
edgeScore ( e ) 
nodes n
edges e


with nodeScore based on tf*idf or prob. IR
and edgeScore reflecting importance of relationships (or confidence, authority, etc.)
Top-k querying: compute best trees, e.g. Steiner trees (NP-hard)
Gerhard Weikum June 14, 2007
18/41
The Present: Observations & Opportunities
• Probabilistic IR and statistical language models
yield principled ranking and high effectiveness
(related to prob. relational models (Suciu, Getoor, …) but different)
• Structural similarity and ranking
based on tree edit distance (FleXPath, Timber, …)
actor
movie movie
movie
plot director
actor actor director
plot
• Aim for comprehensive XML ranking model
capturing content, structure, ontologies
• Aim to generate structure skeleton
in XPath query from user feedback
• Good progress on performance
but still many open efficiency issues
”life physicist Max Planck“
//article[//person ”Max Planck“]
[//category ”physicist“]
//biography
Gerhard Weikum June 14, 2007
19/41
Outline
 Past : Matter, Antimatter, and Wormholes
 Present : XML and Graph IR
• Future : From Data to Knowledge
Gerhard Weikum June 14, 2007
20/41
Knowledge Queries
Turn the Web, Web2.0, and Web3.0 into the world‘s
most comprehensive knowledge base („semantic DB“) !
Answer „knowledge queries“ such as:
proteins that inhibit both protease and some other enzyme
neutron stars with Xray bursts > 1040 erg s-1 & black holes in 10‘‘
differences in Rembetiko music from Greece and from Turkey
connection between Thomas Mann and Goethe
market impact of Web2.0 technology in December 2006
sympathy or antipathy for Germany from May to August 2006
Nobel laureate who survived both world wars and his children
drama with three women making a prophecy
to a British nobleman that he will become king
Gerhard Weikum June 14, 2007
21/41
Three Roads to Knowledge
• Handcrafted High-Quality Knowledge Bases
(Semantic-Web-style ontologies, encyclopedias, etc.)
• Large-scale Information Extraction & Harvesting:
(using pattern matching, NLP, statistical learning, etc.
for product search, Web entity/object search, ...)
• Social Wisdom from Web 2.0 Communities
(social tagging, folksonomies, human computing,
e.g.: del.icio.us, flickr, answers.yahoo, iknow.baidu, ...)
Gerhard Weikum June 14, 2007
22/41
High-Quality Knowledge Sources
• universal „common-sense“ ontologies:
• SUMO (Suggested Upper Merged Ontology): 60 000 OWL axioms
• Cyc: 5 Mio. facts (OpenCyc: 2 Mio. facts)
• domain-specific ontologies:
• UMLS (Unified Medical Language System): 1 Mio. biomedical concepts
135 categories, 54 relations (e.g. virus causes disease | symptom)
• GeneOntology, etc.
• thesauri and concept networks:
• WordNet: 200 000 concepts (word senses) and hypernym/hyponym relations
• can be cast into OWL-lite (or typed graph with statistical weights)
• lexical sources:
• Wikipedia (1.8 Mio. articles, 40 Mio. links, 100 languages) etc.
• hand-tagged natural-language corpora:
• TEI (Text Encoding Initiative) markup of historic encyclopedia
• FrameNet: sentences classified into frames with semantic roles
growing with strong momentum
Gerhard Weikum June 14, 2007
23/41
High-Quality Knowledge Sources
General-purpose thesauri and concept networks: WordNet family
can be cast into
• OWL-lite or into
• graph, with weights for relation strengths
(derived
from that
co-occurrence
enzyme -- (any of several complex
proteins
are producedstatistics)
by cells and
act as catalysts in specific biochemical reactions)
=> protein -- (any of a large group of nitrogenous organic compounds
that are essential constituents of living cells; ...)
=> macromolecule, supermolecule
...
=> organic compound -- (any compound of carbon
and another element or a radical)
...
=> catalyst, accelerator -- ((chemistry) a substance that initiates or
accelerates a chemical reaction
without itself being affected)
=> activator -- ((biology) any agency bringing about activation; ...)
Gerhard Weikum June 14, 2007
24/41
High-Quality Knowledge Sources
Wikipedia and other lexical sources
Gerhard Weikum June 14, 2007
25/41
Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
{{Infobox_Scientist
| name = Max Planck
| birth_date = [[April 23]], [[1858]]
| birth_place = [[Kiel]], [[Germany]]
| death_date = [[October 4]], [[1947]]
| death_place = [[Göttingen]], [[Germany]]
| residence = [[Germany]]
| nationality = [[Germany|German]]
| field = [[Physicist]]
| work_institution = [[University of Kiel]]</br>
[[Humboldt-Universität zu Berlin]]</br>
[[Georg-August-Universität Göttingen]]
| alma_mater = [[Ludwig-Maximilians-Universität München]]
| doctoral_advisor = [[Philipp von Jolly]]
| doctoral_students =
[[Gustav Ludwig Hertz]]</br>
…
| known_for = [[Planck's constant]],
[[Quantum mechanics|quantum theory]]
| prizes = [[Nobel Prize in Physics]] (1918)
…
Gerhard Weikum June 14, 2007
26/41
YAGO: Yet Another Great Ontology
[F. Suchanek, G. Kasneci, GW: WWW 2007]
• Turn Wikipedia into explicit knowledge base (semantic DB)
• Exploit hand-crafted categories and templates
• Represent facts as explicit knowledge triples:
relation (entity1, entity2)
entity1
relation
entity2
(in 1st-order logic, compatible with RDF, OWL-lite, XML, etc.)
• Map (and disambiguate) relations into WordNet concept DAG
Examples:
Max_Planck
bornIn
Kiel
Kiel
isInstanceOf
Gerhard Weikum June 14, 2007
City
27/41
YAGO Knowledge Representation
Knowledge Base
# Facts
subclass
KnowItAll
30
000
SUMO
60 000
WordNet
200 000
Person
OpenCyc
300 000
subclass
Cyc
5 000 000
Scientist
YAGOsubclass
6 000 000
Accuracy: 97%
Entity
subclass
subclass
subclass
Biologist
concepts
Location
subclass
City
Country
Physicist
instanceOf
instanceOf
Erwin_Planck
Nobel Prize
hasWon
October 4, 1947
bornIn Kiel
FatherOf
diedOn
individuals
Max_Planck
bornOn
means
means
“Max Planck”
“Max Karl Ernst
Ludwig Planck”
April 23, 1858
means
“Dr.
Planck”
words
Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/
Gerhard Weikum June 14, 2007
28/41
NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07]
Graph-based search on YAGO-style knowledge bases
with built-in ranking based on confidence and informativeness
 statistical language model for result graphs
conjunctive queries
Kiel
bornIn $x
isa
scientist
queries with regular expressions
Ling
hasFirstName | hasLastName
(coAuthor
| advisor)*
Beng Chin Ooi
$x
isa
scientist
worksFor
$y
locatedIn*
Gerhard Weikum June 14, 2007
Zhejiang
29/41
Ranking Factors
Confidence:
Prefer results that are likely to be correct
 Certainty of IE
 Authenticity and Authority of Sources
Informativeness:
bornIn (Max Planck, Kiel) from
„Max Planck was born in Kiel“
(Wikipedia)
livesIn (Elvis Presley, Mars) from
„They believe Elvis hides on Mars“
(Martian Bloggeria)
q: isa (Einstein, $y)
Prefer results that are likely important
May prefer results that are likely new to user
 Frequency in answer
 Frequency in corpus (e.g. Web)
 Frequency in query log
isa (Einstein, scientist)
isa (Einstein, vegetarian)
q: isa ($x, vegetarian)
isa (Einstein, vegetarian)
isa (Al Nobody, vegetarian)
Compactness:
Prefer results that are tightly connected isa vegetarian
 Size of answer graph
Einstein
Tom
isa Cruise
won
Nobel Prize
Gerhard Weikum June 14, 2007
bornIn
1962
won
Bohr
diedIn
30/41
Information Extraction (IE): Text to Records
Person
BirthDate
Max Planck
4/23, 1858
Albert Einstein 3/14, 1879
Mahatma Gandhi 10/2, 1869
BirthPlace ...
Kiel
Ulm
Porbandar
Person
ScientificResult
Max Planck Quantum Theory
Constant
Value
Dimension
Planck‘s constant 6.2261023 Js
Person
Collaborator
Max Planck Albert Einstein
Max Planck Niels Bohr
combine NLP, pattern matching, lexicons, statistical learning
Gerhard Weikum June 14, 2007
31/41
Knowledge Acquisition from the Web
Learn Semantic Relations from Entire Corpora at Large Scale
(as exhaustively as possible but with high accuracy)
Examples:
• all cities, all basketball players, all composers
• headquarters of companies, CEOs of companies, synonyms of proteins
• birthdates of people, capitals of countries, rivers in cities
• which musician plays which instruments
• who discovered or invented what
• which enzyme catalyzes which biochemical reaction
Existing approaches and tools
(Snowball [Gravano et al. 2000], KnowItAll [Etzioni et al. 2004], …):
almost-unsupervised pattern matching and learning:
seeds (known facts)  patterns (in text)  (extraction) rule  (new) facts
Gerhard Weikum June 14, 2007
32/41
Methods for Web-Scale Fact Extration
seeds

text

rules

new facts
Example:
Example:
city
in
in
city (Seattle)
(Seattle)
in downtown
downtown Seattle
Seattle
in downtown
downtown X
X
city
Seattle
X
city (Seattle)
(Seattle)
Seattle and
and other
other towns
towns
X and
and other
other towns
towns
city
Las VegasLas
andVegas
otherand
towns
and other
towns
city (Las
(Las Vegas)
Vegas)
otherXtowns
X and
other towns
plays
plays (Zappa,
(Zappa, guitar)
guitar) playing
playing guitar:
guitar: …
… Zappa
Zappa playing
playing Y:
Y: …
…X
X
plays
(Davis, trumpet)
trumpet) Davis
Davis …
… blows
blows trumpet
trumpet
X
… blows
Y
plays (Davis,
X…
blows Y
in downtown Beijing
Coltrane blows sax
city(Beijing)
old center of Beijing
plays(Coltrane, sax) sax player Coltrane
city(Beijing)
plays(C., sax)
old center of X
Y player X
Assessment of facts & generation of rules based on statistics
Rules can be more sophisticated:
playing NN: (ADJ|ADV)* NP & class(NN)=instrument & class(head(NP))=person
 plays(head(NP), NN)
Gerhard Weikum June 14, 2007
33/41
Performance of Web-IE
State-of-the-art precision/recall results:
relation
countries
cities
scientists
headquarters
birthdates
instanceOf
precision
80%
80%
60%
90%
80%
40%
recall
90%
???
???
50%
70%
20%
corpus
Web
Web
Web
News
Wikipedia
Web
systems
KnowItAll
KnowItAll
KnowItAll
Snowball, LEILA
LEILA
Text2Onto, LEILA
Open IE
80%
???
Web
TextRunner
precision value-chain: entities 80%, attributes 70%, facts 60%, events 50%
Anecdotic evidence:
invented (A.G. Bell, telephone)
married (Hillary Clinton, Bill Clinton)
isa (yoga, relaxation technique)
isa (zearalenone, mycotoxin)
contains (chocolate, theobromine)
contains (Singapore sling, gin)
invented (Johannes Kepler, logarithm tables)
married (Segolene Royal, Francois Hollande)
isa (yoga, excellent way)
isa (your day, good one)
contains (chocolate, raisins)
plays (the liver, central role)
makes (everybody, mistakes)
Gerhard Weikum June 14, 2007
34/41
Beyond Surface Learning with LEILA
Learning to Extract Information by Linguistic Analysis [F.Suchanek, G.Ifrim, GW: KDD‘06]
Limitation of surface patterns:
who discovered or invented what
“Tesla’s work formed the basis of AC electric power”
“Al Gore funded more work for a better basis of the Internet”
Almost-unsupervised Statistical Learning with Dependency Parsing
(Cologne, Rhine), (Cairo, Nile), …
(Cairo, Rhine), (Rome, 0911), (, [0..9]*), …
NP outperforms
PP
NP VP other
NP
PP NP
NP
NP
LEILA
Web-IE
methods
Cologne lies on the banks of the Rhine
People of
in Cairo
like wine
fromF1,
the but:
Rhine valley
in terms
precision,
recall,
• dependency
Mp Js parser
Osis slow
AN
Ss
MVp
DMc
Mp
Dg
Jp
Js
Sp
Mvp
Ds
• one relation
at a time
NP
VP
PP
NP
NP
PP NP
NP
Js
NP
VP
VP
PP
NP
NP
PP NP
NP
Paris was founded on an island in the Seine
Ss
Pv
MVp
Ds
DG
Js
(Paris, Seine)
Js
MVp
Gerhard Weikum June 14, 2007
35/41
IE Efficiency and Accuracy Tradeoffs
[see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi]
IE is cool, but what‘s in it for DB folks?
•
•
•
precision vs. recall: two-stage processing (filter pipeline)
1) recall-oriented harvesting
2) precision-oriented scrutinizing
preprocessing
• indexing: NLP trees & graphs, N-grams, PoS-tag patterns ?
• exploit ontologies? exploit usage logs ?
turn crawl&extract into set-oriented query processing
• candidate finding
• efficient phrase, pattern, and proximity queries
• optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD‘06]
Gerhard Weikum June 14, 2007
36/41
The Future: Challenges
• Generalize YAGO approach (Wikipedia + WordNet)
• Methods for comprehensive, highly accurate
mappings across many knowledge sources
• cross-lingual, cross-temporal
• scalable in size, diversity, number of sources
• Pursue DB support towards efficient IE (and NLP)
• Achieve Web-scale IE throughput that can
• sustain rate of new content production (e.g. blogs)
• with > 90% accuracy and Wikipedia-like coverage
• Integrate handcrafted knowledge with NLP/ML-based IE
• Incorporate social tagging and human computing
Gerhard Weikum June 14, 2007
37/41
Outline
 Past : Matter, Antimatter, and Wormholes
 Present : XML and Graph IR
 Future : From Data to Knowledge
Gerhard Weikum June 14, 2007
38/41
Major Trends in DB and IR
Database Systems
Information Retrieval
malleable schema (later)
record linkage
deep NLP, adding structure
info extraction
graph mining
entity-relationship graph IR
dataspaces
ontologies
Web objects
statistical language models
data uncertainty
ranking
programmability search as Web Service
Web 2.0
Web 2.0
Gerhard Weikum June 14, 2007
39/41
Conclusion
• DB&IR integration agenda:
• models − ranking, ontologies, prob. SQL ?, graph IR ?
• languages and APIs − XQuery Full-Text++ ?
• systems − drop SQL, go light-weight ?
− combine with P2P, Deep Web, ... ?
• Rethink progress measures and experimental methodology
• Address killer app(s) and grand challenge(s):
• from data to knowledge (Web, products, enterprises)
• integrate knowledge bases, info extraction, social wisdom
• cope with uncertainty; ranking as first-class principle
• Bridge cultural differences between DB and IR:
• co-locate SIGIR and SIGMOD
Gerhard Weikum June 14, 2007
40/41
DB&IR: Both
Sides Now
Joni Mitchell (1969): Both Sides Now
…
I've looked at life from both sides now,
From up and down, and still somehow
It's life's illusions i recall.
I really don't know life at all.
Thank You !
Gerhard Weikum June 14, 2007
41/41
Descargar

Slide 1