Harvesting and Organizing
Knowledge from the Web
Gerhard Weikum
[email protected]
http://www.mpi-inf.mpg.de/~weikum/
In collaboration with Giorgiana Ifrim, Gjergji Kasneci,
Josiane Parreira, Maya Ramanath, Ralf Schenkel,
Fabian Suchanek, Martin Theobald
Vision
Opportunity:
Turn the Web (and Web 2.0 and Web 3.0 ...) into the
world‘s most comprehensive knowledge base (semantic DB)
Challenge: seize opportunity and make it happen!
Approach: combine and exploit synergies of
• hand-crafted, high-quality knowledge sources
 Semantic Web
• automatic knowledge extraction
 Statistical Web
• social networks and human computing
 Social Web
Gerhard Weikum ADBIS Oct 1, 2007
2/53
Proof of Relevance
Vannevar Bush:
As We May Think,
1945.
There is a growing mountain of research.
…
A memex is a device in which an individual stores all his books,
records, and communications, and which is mechanized so that
it may be consulted with exceeding speed and flexibility.
It is an enlarged intimate supplement to his memory.
Gerhard Weikum ADBIS Oct 1, 2007
3/53
Proof of Relevance
Tim Berners-Lee: In the Semantic Web information is
given well-defined meaning.
Jim Gray:
… system can answer questions
about the text as precisely and quickly
as a human expert.
Brewster Kahle: The goal of universal access to
our cultural heritage is within our grasp.
Jimmy Wales:
Our big-picture vision is to share
knowledge with all of humanity.
Al Gore:
The future will be better tomorrow.
Gerhard Weikum ADBIS Oct 1, 2007
4/53
Proof of Relevance ?
To know that we know what we know,
and that we do not know what we do not know,
that is true knowledge.
You cannot open a book without learning something.
A journey of a thousand miles begins with a single step.
Confucius, 551-479 BC
Ignorance is less remote from the truth
than prejudice.
When science, art, literature, and philosophy are simply the
manifestation of personality,
they can make a man's name live for thousands of years.
Sentences are like sharp nails, which force truth upon our memories.
Denis Diderot, 1713-1784
Gerhard Weikum ADBIS Oct 1, 2007
5/53
Why Google and Wikipedia Are Not Enough
Turn the Web, Web2.0, and Web3.0 into the world‘s
most comprehensive knowledge base („semantic DB/graph“) !
Answer „knowledge queries“ such as:
proteins that inhibit both protease and some other enzyme
neutron stars with Xray bursts > 1040 erg s-1 & black holes in 10‘‘
differences in Rembetiko music from Greece and from Turkey
connection between Thomas Mann and Goethe
market impact of Web2.0 technology in December 2006
sympathy or antipathy for Germany from May to August 2006
Nobel laureate who survived both world wars and his children
drama with three women making a prophecy
to a British nobleman that he will become king
Gerhard Weikum ADBIS Oct 1, 2007
6/53
Outline
 Introduction: Search for Knowledge
• Harvesting Knowledge
• Leibniz Approach
• Planck Approach
• Darwin Approach
• Conclusion
Gerhard Weikum ADBIS Oct 1, 2007
7/53
Three Roads to Knowledge
Leibniz Approach:
Handcrafted High-Quality Knowledge Sources
(Semantic Web)
Planck Approach:
Large-scale Information Extraction & Harvesting
(Statistical Web)
Darwin Approach:
Social Wisdom from Web 2.0 Communities
(Social Web)
Gerhard Weikum ADBIS Oct 1, 2007
8/53
Leibniz Approach (Semantic Web)
Gottfried Wilhelm Leibniz
(1646 - 1716)
Handcrafted High-Quality Knowledge:
• Ontologies and other Lexical Sources
• Build on Rigorous Knowledge Atoms
(„Characteristica Universalis“)
Gerhard Weikum ADBIS Oct 1, 2007
9/53
High-Quality Knowledge Sources
General-purpose ontologies for Semantic Web: SUMO, Cyc, etc.
Gerhard Weikum ADBIS Oct 1, 2007
10/53
High-Quality Knowledge Sources
General-purpose thesauri and concept networks: WordNet family
woman, adult female – (an adult female person)
=> amazon, virago – (a large strong and aggressive woman)
=> donna -- (an Italian woman of rank)
=> geisha, geisha girl -- (...)
=> lady (a polite name for any woman)
...
=> wife – (a married woman, a man‘s partner in marriage)
=> witch – (a being, usually female, imagined to
have special powers derived from the devil)
Gerhard Weikum ADBIS Oct 1, 2007
11/53
High-Quality Knowledge Sources
General-purpose thesauri and concept networks: WordNet family
200 000 concepts and relations;
can be cast into
• description logics or
• graph,proteins
with weights
relation
enzyme -- (any of several complex
that arefor
produced
bystrengths
cells and
from co-occurrence
act as catalysts in(derived
specific biochemical
reactions)statistics)
=> protein -- (any of a large group of nitrogenous organic compounds
that are essential constituents of living cells; ...)
=> macromolecule, supermolecule
...
=> organic compound -- (any compound of carbon
and another element or a radical)
...
=> catalyst, accelerator -- ((chemistry) a substance that initiates or
accelerates a chemical reaction
without itself being affected)
=> activator -- ((biology) any agency bringing about activation; ...)
Gerhard Weikum ADBIS Oct 1, 2007
12/53
High-Quality Knowledge Sources
Domain ontologies (UMLS, GeneOntology, etc.)
• 1 Mio. biomedical concepts, 135 categories,
• 54 relationships (e.g. virus causes (disease | symptom) )
Gerhard Weikum ADBIS Oct 1, 2007
13/53
High-Quality Knowledge Sources
Wikipedia and other lexical sources
• 2 Mio. articles
• 40 Mio. hyperlinks
• many 1000‘s of categories and lists
• more than 100 languages
• growing very fast
Gerhard Weikum ADBIS Oct 1, 2007
14/53
Exploit Hand-Crafted Knowledge
Wikipedia, WordNet, and other lexical sources
{{Infobox_Scientist
| name = Max Planck
| birth_date = [[April 23]], [[1858]]
| birth_place = [[Kiel]], [[Germany]]
| death_date = [[October 4]], [[1947]]
| death_place = [[Göttingen]], [[Germany]]
| residence = [[Germany]]
| nationality = [[Germany|German]]
| field = [[Physicist]]
| work_institution = [[University of Kiel]]</br>
[[Humboldt-Universität zu Berlin]]</br>
[[Georg-August-Universität Göttingen]]
| alma_mater = [[Ludwig-Maximilians-Universität München]]
| doctoral_advisor = [[Philipp von Jolly]]
| doctoral_students =
[[Gustav Ludwig Hertz]]</br>
…
| known_for = [[Planck's constant]],
[[Quantum mechanics|quantum theory]]
| prizes = [[Nobel Prize in Physics]] (1918)
…
Gerhard Weikum ADBIS Oct 1, 2007
15/53
YAGO: Yet Another Great Ontology
[F. Suchanek, G. Kasneci, G. Weikum: WWW‘07]
• Turn Wikipedia into explicit knowledge base (semantic DB)
• Exploit hand-crafted categories and templates
• Represent facts as explicit knowledge triples:
relation (entity1, entity2)
entity1
relation
entity2
(in FOL, compatible with RDF, OWL-lite, XML, etc.)
• Map (and disambiguate) relations into WordNet concept DAG
Examples:
Max_Planck
bornIn
Kiel
Kiel
Gerhard Weikum ADBIS Oct 1, 2007
isInstanceOf
City
16/53
YAGO Knowledge Representation
Knowledge Base
# Facts
subclass
KnowItAll
30
000
SUMO
60 000
WordNet
200 000
Person
OpenCyc
300 000
subclass
Cyc
5 000 000
Scientist
YAGOsubclass
6 000 000
Accuracy  97%
Entity
subclass
subclass
subclass
Biologist
concepts
Location
subclass
City
Country
Physicist
instanceOf
instanceOf
Erwin_Planck
Nobel Prize
hasWon
October 4, 1947
bornIn Kiel
FatherOf
diedOn
individuals
Max_Planck
bornOn
means
means
“Max Planck”
“Max Karl Ernst
Ludwig Planck”
April 23, 1858
means
“Dr.
Planck”
words
Online access and download at http://www.mpi-inf.mpg.de/~suchanek/yago/
Gerhard Weikum ADBIS Oct 1, 2007
17/53
YAGO Disambiguation & Uncertainty
capture
confidence
value
for each fact
Entity
subclass
1.0 subclass 1.0
Person
subclass
Celebrity
0.7
subclass 1.0
subclass 1.0
Mythological
Figure
City
instanceOf
0.4
Location
0.8
instanceOf
subclass 1.0
Country
0.9
instanceOf
1.0
instanceOf
Paris
Hilton
Paris(Myth.)
Paris(France)
means
locatedIn
France
0.95
0.7
means 0.1
means 0.05
“Paris”
means 0.2 means 0.9
“La Grande
Nation”
“France”
additional harvesting of relations from natural-language texts
by info-extraction tools
Gerhard Weikum ADBIS Oct 1, 2007
18/53
NAGA: Graph IR on YAGO [G. Kasneci et al.: WWW‘07]
Graph-based search on YAGO-style knowledge bases
with built-in ranking based on statistical language model
discovery queries
Kiel
bornIn $x
isa
scientist
hasWon Nobel
prize
hasSon
diedOn $x
$a
>
$b
diedOn $y
connectedness queries
German
novelist
isa
*
Thomas Mann
Goethe
queries with regular expressions
Ling
hasFirstName | hasLastName
(coAuthor
| advisor)*
Beng Chin Ooi
$x
isa
scientist
worksFor
$y
locatedIn*
Gerhard Weikum ADBIS Oct 1, 2007
Zhejiang
19/53
NAGA: Searching Knowledge
q: Fisher isa scientist
Fisher isa $x
[email protected] = Ronald_Fisher
[email protected] = scientist_109871938
$X = alumnus_109165182
[email protected] = Irving_Fisher
[email protected] = scientist_109871938
mathematician_109635652 —subClassOf—> scientist_109871938
$X = social_scientist_109927304
Alumni_of_Gonville_and_Caius_College,[email protected]
—subClassOf—>
alumnus_10916
= James_Fisher
"Fisher" —familyNameOf—> Ronald_Fisher
[email protected] = scientist_10981938
Ronald_Fisher —type—> Alumni_of_Gonville_and_Caius_College,_Cambridge
$X = ornithologist_109711173
Ronald_Fisher —type—> 20th_century_mathematicians
[email protected] = Ronald_Fisher
"scientist" —means—> scientist_109871938
[email protected] = scientist_109871938
$X = theorist_110008610
[email protected] = Ronald_Fisher
[email protected] = scientist_109871938
$X = colleague_109301221
[email protected] = Ronald_Fisher
[email protected] = scientist_109871938
$X = organism_100003226
…
Gerhard Weikum ADBIS Oct 1, 2007
20/53
NAGA: Searching & Ranking Knowledge
q: Fisher isa scientist
Fisher isa $x
[email protected] = Ronald_Fisher
[email protected] = scientist_109871938
$X = mathematician_109635652
[email protected] = Ronald_Fisher
[email protected] = scientist_109871938
$X = statistician_109958989
[email protected] = Ronald_Fisher
[email protected]
= scientist_109871938
Score: 7.184462521168058E-13 mathematician_109635652
—subClassOf—>
scientist
$X
=
president_109787431
"Fisher" —familyNameOf—> Ronald_Fisher
Ronald_Fisher —type—> 20th_century_mathematicians
[email protected] = Ronald_Fisher
"scientist" —means—> scientist_109871938
[email protected] = scientist_109871938
20th_century_mathematicians —subClassOf—> mathematician_109635652
$X = geneticist_109475749
[email protected] = Ronald_Fisher
[email protected] = scientist_109871938
$X = scientist_109871938
…
Online access at http://www.mpi-inf.mpg.de/~kasneci/naga/
Gerhard Weikum ADBIS Oct 1, 2007
21/53
Ranking Factors
bornIn (Max Planck, Kiel) from
„Max Planck was born in Kiel“
(Wikipedia)
Confidence:
Prefer results that are likely to be correct
 Certainty of IE
 Authenticity and Authority
livesIn (Elvis Presley, Mars) from
„They believe Elvis hides on Mars“
of Sources
(Martian Bloggeria)
Informativeness:
q: isa (Einstein, $y)
Prefer results that are likely important
May prefer results that are likely new to user
 Frequency in answer
 Frequency in corpus (e.g. Web)
 Frequency in query log
Compactness:
Prefer results that are
tightly connected
 Size of answer graph
isa
isa (Einstein, scientist)
isa (Einstein, vegetarian)
q: isa ($x, vegetarian)
isa (Einstein, vegetarian)
isa (Al Nobody, vegetarian)
vegetarian
Tom
isa Cruise
Einstein
won
Nobel Prize
Gerhard Weikum ADBIS Oct 1, 2007
bornIn
1962
won
Bohr
diedIn
22/53
Summary of Leibniz Approach
Hand-crafted knowledge sources are great assets,
but expensive, partial, and isolated
Great mileage even from informal & semiformal sources
Connecting & reconciling different sources gives added value
(and sometimes is not even that hard)
Challenge:
• Develop methods for comprehensive, highly accurate
mappings across many knowledge sources
• Cross-lingual
• Cross-temporal
• Scalable
Gerhard Weikum ADBIS Oct 1, 2007
23/53
Planck Approach (Statistical Web)
Max Planck
(1858 - 1947)
Information Extraction & Harvesting:
• Gather Entities, Relations, Facts
• Live with Uncertainty
Gerhard Weikum ADBIS Oct 1, 2007
24/53
Information Extraction (IE): Text to Records
Person
BirthDate
Max Planck
4/23, 1858
Albert Einstein 3/14, 1879
Mahatma Gandhi 10/2, 1869
BirthPlace ...
Kiel
Ulm
Porbandar
Person
ScientificResult
Max Planck Quantum Theory
Constant
Value
Dimension
Planck‘s constant 6.2261023 Js
Person
Collaborator
Max Planck Albert Einstein
Max Planck Niels Bohr
combine NLP, pattern matching, lexicons, statistical learning
Gerhard Weikum ADBIS Oct 1, 2007
25/53
IE Technology: Rules, Patterns, Learning
For natural-language text and for heterogeneous sources:
• NLP techniques (parser, PoS tagging) for tokenization
• identify patterns (e.g. regular expressions) as features
• train statistical learners for segmentation and labeling
• use learned model to automatically tag newly seen input
Training data:
<location>
The WWW conference takes place in Banff in Canada.
<organization>
Today‘s keynote speaker is Dr. Berners-Lee from W3C.
<person>
The panel in Edinburgh, chaired by Ron Brachman from Yahoo!, …
<event>
…
<lecture>
NP NP
NN IN DT NP VB IN DT ADJ
NN
PP
NP
IN
CD
Ian Foster, father of the Grid, talks at the GES conference in Germany on 05/02/07.
<person>
<event>
Gerhard Weikum ADBIS Oct 1, 2007
<location>
<date>
26/53
Knowledge Acquisition from the Web
Learn Semantic Relations from Entire Corpora at Large Scale
(as exhaustively as possible but with high accuracy)
Examples:
• all cities, all basketball players, all composers
• headquarters of companies, CEOs of companies, synonyms of proteins
• birthdates of people, capitals of countries, rivers in cities
• which musician plays which instruments
• who discovered or invented what
• which enzyme catalyzes which biochemical reaction
Existing approaches and tools use
almost-unsupervised pattern matching and learning:
seeds (known facts)  patterns (in text)  (extraction) rule  (new) facts
Gerhard Weikum ADBIS Oct 1, 2007
27/53
Methods for Web-Scale Fact Extration
seeds

text

rules

new facts
Example:
Example:
city
in
in
city (Seattle)
(Seattle)
in downtown
downtown Seattle
Seattle
in downtown
downtown X
X
city
Seattle
X
city (Seattle)
(Seattle)
Seattle and
and other
other towns
towns
X and
and other
other towns
towns
city
Las VegasLas
andVegas
otherand
towns
and other
towns
city (Las
(Las Vegas)
Vegas)
otherXtowns
X and
other towns
plays
plays (Zappa,
(Zappa, guitar)
guitar) playing
playing guitar:
guitar: …
… Zappa
Zappa playing
playing Y:
Y: …
…X
X
plays
(Davis, trumpet)
trumpet) Davis
Davis …
… blows
blows trumpet
trumpet
X
… blows
Y
plays (Davis,
X…
blows Y
in downtown Delhi
Coltrane blows sax
city(Delhi)
old center of Delhi
plays(Coltrane, sax) sax player Coltrane
city(Delhi)
plays(C., sax)
old center of X
Y player X
Assessment of facts & generation of rules based on statistics
Rules can be more sophisticated:
playing NN: (ADJ|ADV)* NP & class(head(NP))=person
 plays (head(NP), NN)
Gerhard Weikum ADBIS Oct 1, 2007
28/53
Performance of Web-IE
State-of-the-art precision/recall results:
relation
precision recall
corpus
systems
countries
cities
scientists
CEOs
birthdates
instanceOf
80%
80%
60%
80%
80%
40%
Web
Web
Web
News
Wikipedia
Web
KnowItAll
KnowItAll
KnowItAll
Snowball, LEILA
LEILA
Text2Onto, LEILA
90%
???
???
50%
70%
20%
precision value-chain: entities 80%, attributes 70%, facts 60%, events 50%
Anecdotic evidence:
invented (A.G. Bell, telephone)
married (Hillary Clinton, Bill Clinton)
isa (yoga, relaxation technique)
isa (zearalenone, mycotoxin)
contains (chocolate, theobromine)
contains (Singapore sling, gin)
invented (Johannes Kepler, logarithm tables)
married (Segolene Royal, Francois Hollande)
isa (yoga, excellent way)
isa (your day, good one)
contains (chocolate, raisins)
plays (the liver, central role)
makes (everybody, mistakes)
Gerhard Weikum ADBIS Oct 1, 2007
29/53
Beyond Surface Learning with LEILA
Learning to Extract Information by Linguistic Analysis [F. Suchanek et al.: KDD’06]
Limitation of surface patterns:
who discovered or invented what
“Tesla’s work formed the basis of AC electric power”
“Al Gore funded more work for a better basis of the Internet”
Almost-unsupervised Statistical Learning with Dependency Parsing
(Cologne, Rhine), (Cairo, Nile), …
(Cairo, Rhine), (Rome, 0911), (, [0..9]*), …
NP
PP
NP VP
NPother
PP Web-IE
NP
NP
NP
LEILA
outperforms
Cologne lies on the banks of the Rhine People
in Cairo
like wine from
the Rhine valley
methods
in precision
and recall,
but dependency
parser is slow AN
Mp Js
Os
Ss
MVp
DMc
Mp
Dg
NP
VP
PP
NP
NP
PP NP
Jp
NP
Js
Sp
Mvp
Ds
Js
NP
VP
VP
PP
NP
NP
PP NP
NP
Paris was founded on an island in the Seine
Ss
Pv
MVp
Ds
DG
Js
(Paris, Seine)
Js
MVp
We visited Paris last summer. It has many museums along the banks of the Seine.
Gerhard Weikum ADBIS Oct 1, 2007
30/53
IE Efficiency and Accuracy Tradeoffs
[see also tutorials by Cohen, Doan/Ramakrishnan/Vaithyanathan, Agichtein/Sarawagi]
IE is cool, but what‘s in it for DB folks?
•
•
•
precision vs. recall: two-stage processing (filter pipeline)
1) recall-oriented harvesting
2) precision-oriented scrutinizing
preprocessing
• indexing: NLP trees & graphs, N-grams, PoS-tag patterns ?
• exploit ontologies? exploit usage logs ?
turn crawl&extract into set-oriented query processing
• candidate finding
• efficient phrase, pattern, and proximity queries
• optimizing entire text-mining workflows [Ipeirotis et al.: SIGMOD‘06]
Gerhard Weikum ADBIS Oct 1, 2007
31/53
Summary of Planck Approach
Human text (and speech) is diverse and produced at
higher rate than manual high-quality annotations
IE offers reasonably robust and scalable methods for
harvesting named entities and binary relations
? Deep NLP and advanced ML are computational bottleneck
? Disambiguation (entity matching, record linkage) needed
„Joe Hellerstein (UC Berkeley)“ = „Prof. Joseph M. Hellerstein, California)“
„Max Planck Institute“ = „MPI“ ≠ „MPI“ = „Message Passing Institute“
Challenge:
Achieve Web-scale IE throughput that can
• sustain rate of new content production (e.g. blogs)
(may need large-scale P2P/Grid)
• with > 90% accuracy and Wikipedia-like coverage
Gerhard Weikum ADBIS Oct 1, 2007
32/53
Darwin Approach (Social Web)
Charles Darwin
(1809 - 1882)
Social Wisdom & Natural Selection:
• Evolution of (Web 2.0) species
• Survival of the fittest
Gerhard Weikum ADBIS Oct 1, 2007
33/53
„Wisdom of Crowds“ at Work on Web 2.0
Information enrichment & knowledge extraction by humans:
• Collaborative Recommendations & QA
• Amazon (product ratings & reviews, recommended products)
• Netflix: movie DVD rentals  $ 1 Mio. Challenge
• answers.yahoo, iknow.baidu, etc.
• Social Tagging and Folksonomies
• del.icio.us: Web bookmarks and tags
• flickr: photo annotation, categorization, rating
• YouTube: same for video
• Human Computing in Game Form
• ESP and Google Image Labeler: image tagging
• Peekaboom: image segmenting and tagging
• Verbosity: facts from natural-language sentences
• Online Communities
• dblife.cs.wisc.edu for database research
• www.lt-world.org for language technology
• Yahoo! Groups, Myspace, Facebook, etc. etc.
Gerhard Weikum ADBIS Oct 1, 2007
34/53
Social Tagging: Example Flickr (1)
Gerhard Weikum ADBIS Oct 1, 2007
35/53
Social Tagging: Example Flickr (2)
Gerhard Weikum ADBIS Oct 1, 2007
36/53
Social Tagging: Example Flickr (3)
Gerhard Weikum ADBIS Oct 1, 2007
37/53
Social-Tagging Community
> 1 Mio. users
> 100 Mio. photos
> 1 Bio. tags
30% monthly growth
Gerhard Weikum ADBIS Oct 1, 2007
38/53
Source: www.flickr.com
ESP Game
[Luis von Ahn et al. 2004 ]
played against random, anonymous partner on Internet
taboo:
pyramid
Louvre
museum
Paris
art
• Game with a purpose
• Collects annotations (wisdom)
• Can exploit tag statistics (crowds)
• Attracts people, fun to play, some play hours
your
partner
suggested:
myfrom
labels:
• ESP
gamehas
collected
> 10 Mio. tags
> 20000 users
3
7
11
17
labels
labelspeople could tag all photos on
reflection
• 5000
the Web in 4 weeks
water
(human computing)
Congratulations!
You scored 1 point!
Mitterand
Mona Lisa
metro lignes 7, 1
14
Da Vinci code
Gerhard Weikum ADBIS Oct 1, 2007
39/53
More Human Computing
Verbosity [von Ahn 2006]:
• Collect common-knowledge facts (relation instances)
• 2 players: Narrator (N) and Guessor (G)
N gives stylized clues:
is a kind of …, is used for …, is typically near/in/on …, is the opposite of …
• random pairing for independence,
can build statistics over many games for same concept
Peekaboom, Phetch, etc.:
locating & tagging objects in images, finding images, etc.
• incentives to play ?
• game design for moving up the value-chain ?
Gerhard Weikum ADBIS Oct 1, 2007
40/53
Dark Side of Social Wisdom
• Spam (Web spam – not just for email anymore):
lucky online casino, easy MBA diploma, cheap V!-4-gra, etc.;
law suits about „appropriate Google rank“
• Truthiness:
degree to which something is truthy (not necessarily facty);
truthy := property of something you know from your guts
• Disputes:
editorial fights over critical Wikipedia articles;
Citizendium: new endeavor with "gentle expert oversight"
• Dishonesty, Bias, …
Gerhard Weikum ADBIS Oct 1, 2007
41/53
Gerhard Weikum ADBIS Oct 1, 2007
42/53
Gerhard Weikum ADBIS Oct 1, 2007
43/53
The Wisdom of Crowds: PageRank
PageRank (PR): links are endorsements & increase page authority,
authority is higher if links come from high-authority pages
P R ( q )    j( q )  ( 1   )  
PR ( p )  t ( p , q )
p IN ( q )
Social Ranking
with t ( p , q )  1 / outdegree( p)
and j ( q )  1 / N
equivalent to
principal eigenvector:
p n  1  An  n  p n  1
Authority (page q) =
stationary prob. of visiting q
random walk: uniformly random choice of links + random jumps;
add bias to transitions and jumps for personal PR, TrustRank, etc.
Gerhard Weikum ADBIS Oct 1, 2007
44/53
The Wisdom of Crowds: Beyond PR
users
tags
docs
Typed graphs: data items, users, friends, groups,
postings, ratings, queries, clicks, …
with weighted edges  spectral analysis of various graphs
Evolving over time
 tensor analysis
Gerhard Weikum ADBIS Oct 1, 2007
45/53
Decentralized Graph Analysis
Graph spectral analysis applied to:
• pages, sites, tags, users, groups, queries, clicks, opinions, etc. as nodes
• assessment and interaction relations as weighted edges
• can compute various notions of authority, reputation, trust, quality
global graph
local
subgraph 1
local subgraph 3
local subgraph 2
Decentralized computation in peer-to-peer network
with arbitrary, a-priori unknown overlaps of graph fragments
Gerhard Weikum ADBIS Oct 1, 2007
46/53
JXP Algorithm [J.X. Parreira, G. Weikum: WebDB’05, VLDB’06]
Decentralized, asynchronous, peer-to-peer algorithm
based on theory of Markov-chain aggregation (state lumping)
[P.J. Courtois 1977, C.D. Meyer 1988]
Theorem:
authority scores
from local computations
converge to global scores
Spearmans footrule distance at Top-100
• each peer aggregates non-local part of global graph into „world node“
• peers meet randomly,
• exchange data about their local computations, and
• iterate their local computations
1
0.8
Subset "Computers & Internet"
10595 nodes - 20 peers
Pages randomly distributed among peers
0.6
0.4
0.2
0
0
20
40
60
80
100
Number of Meetings
supported by Minerva system
http://www.mpi-inf.mpg.de/departments/d5/software/minerva/index.html
Gerhard Weikum ADBIS Oct 1, 2007
120
47/53
Summary of Darwin Approach
Social tagging and social networks (Web 2.0) are
potentially valuable knowledge sources
Games (human computing) are an interesting way
of enticing „knowledge input“ and collecting statistics
Spectral analysis is a highly versatile tool for rating & ranking
that can be extended and scaled by decentralized algorithms
Challenges:
• Design a game that intrigues serious scientists
to „semantically“ annotate their scholarly work
• Develop an analysis method that identifies the „best“ facts,
resilient to egoistic and malicious behaviors (incl. coalitions)
Gerhard Weikum ADBIS Oct 1, 2007
48/53
Outline
 Introduction: Search for Knowledge
 Harvesting Knowledge
• Leibniz Approach
• Planck Approach
• Darwin Approach
• Conclusion
Gerhard Weikum ADBIS Oct 1, 2007
49/53
Summary
Harvesting knowledge & organizing in semantic DB/graph for
• scholarly Web,
• digital libraries,
• enterprise know-how,
• online communities, etc.
Three roads to knowledge:
• Leibniz / Semantic Web: ontologies, encyclopedia, etc.
• Planck / Statistical Web: large-scale IE from text, speech, etc.
• Darwin / Social Web: wisdom of crowds, tagging, folksonomies
Not covered here: search and ranking
• graph IR (for ER graphs, RDF, cross-linked XML, etc.)
• new ranking models (e.g. statistical LM for graphs)
• efficient and scalable query processing
Gerhard Weikum ADBIS Oct 1, 2007
50/53
Major Challenges
• Generalize YAGO approach (Wikipedia + WordNet)
• Methods for comprehensive, highly accurate
mappings across many knowledge sources
• cross-lingual, cross-temporal
• scalable in size, diversity, number of sources
• Pursue DB support towards efficient IE (and NLP)
• Achieve Web-scale IE throughput that can
• sustain rate of new content production (e.g. blogs)
• with > 90% accuracy and Wikipedia-like coverage
• Integrate handcrafted knowledge with NLP/ML-based IE
• Incorporate social tagging and human computing
Gerhard Weikum ADBIS Oct 1, 2007
51/53
Potential Synergies among
Leibniz, Planck, and Darwin
knowledge core
bootstrap
Leibniz
Semantic Web
Planck
Statistical Web
emerge
communities
validate
Darwin
Social Web
Gerhard Weikum ADBIS Oct 1, 2007
statistics &
feedback
52/53
Thank you !
Gerhard Weikum ADBIS Oct 1, 2007
53/53
Descargar

Slide 1