Information Retrieval &
Web Information Access
ChengXiang (“Cheng”) Zhai
Department of Computer Science
Graduate School of Library & Information Science
Statistics, and Institute for Genomic Biology
University of Illinois, Urbana-Champaign
1
MIAS Tutorial Summer 2011
Introduction
• A subset of lectures given for CS410 “Text
Information Systems” at UIUC:
– http://times.cs.uiuc.edu/course/410s11/
• Tutorial to be given on Mon, Tue, Wed, and
Thu (Prof. Bing Liu will speak on Friday
afternoon)
2
MIAS Tutorial Summer 2011
Big Picture
Applications
Models
Statistics
Optimization
Machine Learning
Pattern Recognition
Data Mining
Information
Retrieval
Natural
Language
Processing
Algorithms
Applications
Web, Bioinformatics…
Computer
Vision
Library & Info
Science
Databases
Software engineering
Computer systems
Systems
3
MIAS Tutorial Summer 2011
Tutorial Outline
•
•
•
Part 1: Information retrieval techniques
– 1.1 Overview of IR
– 1.2 Retrieval models
– 1.3 Evaluation
– 1.4 Retrieval systems
– 1.5 Information filtering
Part 2: Text mining techniques
– 2.1 Overview of text mining
– 2.2 IR-style text mining
– 2.3 NLP-style text mining
– 2.4 ML-style text mining
Part 3: Web search
– 3.1 Overview of web information management
– 3.2 Search engine technologies
– 3.3 Next-generation search engines
4
MIAS Tutorial Summer 2011
Part 1: Information Retrieval
Techniques
- 1.1 Overview of Information Retrieval
- 1.2 Information Retrieval Models
- 1.3 Evaluation
- 1.4 Information Retrieval Systems
- 1.5 Information Filtering
5
MIAS Tutorial Summer 2011
Part 1.1: Overview of Information
Retrieval
6
MIAS Tutorial Summer 2011
What is Information Retrieval (IR)?
• Narrow sense: text retrieval (TR)
– There exists a collection of text documents
– User gives a query to express the information need
– A retrieval system returns relevant documents to
users
– Known as “search technology” in industry
• Broad sense: information access
– May include non-textual information
– May include text categorization or summarization…
7
MIAS Tutorial Summer 2011
TR vs. Database Retrieval
• Information
– Unstructured/free text vs. structured data
– Ambiguous vs. well-defined semantics
• Query
– Ambiguous vs. well-defined semantics
– Incomplete vs. complete specification
• Answers
– Relevant documents vs. matched records
• TR is an empirically defined problem!
8
MIAS Tutorial Summer 2011
History of TR on One Slide
• Birth of TR
– 1945: V. Bush’s article “As we may think”
– 1957: H. P. Luhn’s idea of word counting and matching
•
Indexing & Evaluation Methodology (1960’s)
– Smart system (G. Salton’s group)
– Cranfield test collection (C. Cleverdon’s group)
– Indexing: automatic can be as good as manual
•
•
TR Models (1970’s & 1980’s) …
Large-scale Evaluation & Applications (1990’s-Present)
– TREC (D. Harman & E. Voorhees, NIST)
– Web search (Google, Yahoo!, Bing)
– PubMed (Biomedical Literature Search)
MIAS Tutorial Summer 2011
9
Short vs. Long Term Info Need
• Short-term information need (Ad hoc retrieval)
– “Temporary need”, e.g., info about used cars
– Information source is relatively static
– User “pulls” information
– Application example: library search, Web search
• Long-term information need (Filtering)
– “Stable need”, e.g., new data mining algorithms
– Information source is dynamic
– System “pushes” information to user
– Applications: news filter
10
MIAS Tutorial Summer 2011
Importance of Ad hoc Retrieval
• Directly manages any existing large collection
of information
• There are many “ad hoc” information needs
• A long-term information need can be satisfied
through frequent ad hoc retrieval
• Basic techniques of ad hoc retrieval can be
used for filtering and other “non-retrieval”
tasks, such as automatic summarization.
11
MIAS Tutorial Summer 2011
Formal Formulation of TR
• Vocabulary V={w1, w2, …, wN} of language
• Query q = q1,…,qm, where qi  V
• Document di = di1,…,dimi, where dij  V
• Collection C= {d1, …, dk}
• Set of relevant documents R(q)  C
– Generally unknown and user-dependent
– Query is a “hint” on which doc is in R(q)
• Task =
compute R’(q), an “approximate R(q)”
12
MIAS Tutorial Summer 2011
Computing R(q)
• Strategy 1: Document selection
– R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an
indicator function or classifier
– System must decide if a doc is relevant or not
(“absolute relevance”)
• Strategy 2: Document ranking
– R(q) = {dC|f(d,q)>}, where f(d,q)  is a relevance
measure function;  is a cutoff
– System must decide if one doc is more likely to be
relevant than another (“relative relevance”)
13
MIAS Tutorial Summer 2011
Document Selection vs. Ranking
True R(q)
+ +- - + - + +
--- ---
1
Doc Selection
f(d,q)=?
Doc Ranking
f(d,q)=?
MIAS Tutorial Summer 2011
0
+ +- + ++
R’(q)
- -- - - + - 0.98 d1 +
0.95 d2 +
0.83 d3 0.80 d4 +
0.76 d5 0.56 d6 0.34 d7 0.21 d8 +
0.21 d9 -
R’(q)
14
Problems of Doc Selection
• The classifier is unlikely accurate
– “Over-constrained” query (terms are too specific):
no relevant documents found
– “Under-constrained” query (terms are too general):
over delivery
– It is extremely hard to find the right position
between these two extremes
• Even if it is accurate,
all relevant documents
are not equally relevant
• Relevance is a matter of degree!
15
MIAS Tutorial Summer 2011
Ranking is often preferred
• Relevance is a matter of degree
• A user can stop browsing anywhere, so the
boundary is controlled by the user
– High recall users would view more items
– High precision users would view only a few
• Theoretical justification: Probability Ranking
Principle [Robertson 77]
16
MIAS Tutorial Summer 2011
Probability Ranking Principle
[Robertson 77]
• As stated by Cooper
“If a reference retrieval system’s response to each request is a ranking of the
documents in the collections in order of decreasing probability of usefulness
to the user who submitted the request, where the probabilities are estimated
as accurately a possible on the basis of whatever data made available to the
system for this purpose, then the overall effectiveness of the system to its
users will be the best that is obtainable on the basis of that data.”
• Robertson provides two formal justifications
• Assumptions: Independent relevance and
sequential browsing (not necessarily all hold
in reality)
17
MIAS Tutorial Summer 2011
According to the PRP, all we need is
“A relevance measure function f”
which satisfies
For all q, d1, d2,
f(q,d1) > f(q,d2) iff p(Rel|q,d1) >p(Rel|q,d2)
18
MIAS Tutorial Summer 2011
Part 1.2: Information Retrieval
Models
19
MIAS Tutorial Summer 2011
Model 1: Vector Space Model
20
MIAS Tutorial Summer 2011
Relevance = Similarity
• Assumptions
– Query and document are represented similarly
– A query can be regarded as a “document”
– Relevance(d,q)  similarity(d,q)
• R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))
• Key issues
– How to represent query/document?
– How to define the similarity measure ?
21
MIAS Tutorial Summer 2011
Vector Space Model
• Represent a doc/query by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a high-dimensional space
– Element of vector corresponds to term weight
– E.g., d=(x1,…,xN), xi is “importance” of term i
• Measure relevance by the distance between
the query vector and document vector in the
vector space
22
MIAS Tutorial Summer 2011
VS Model: illustration
Starbucks
D9
D11
??
??
D2
D5
D3
D10
D4 D6
Java
Query
D7
D8
D1
Microsoft
??
MIAS Tutorial Summer 2011
23
What the VS model doesn’t say
• How to define/select the “basic concept”
– Concepts are assumed to be orthogonal
• How to assign weights
– Weight in query indicates importance of term
– Weight in doc indicates how well the term
characterizes the doc
• How to define the similarity/distance measure
24
MIAS Tutorial Summer 2011
What’s a good “basic concept”?
• Orthogonal
– Linearly independent basis vectors
– “Non-overlapping” in meaning
• No ambiguity
• Weights can be assigned automatically and
hopefully accurately
• Many possibilities: Words, stemmed words,
phrases, entities, “latent concept”, …
25
MIAS Tutorial Summer 2011
How to Assign Weights?
• Very important!
• Why weighting
– Query side: Not all terms are equally important
– Doc side: Some terms carry more information about contents
• How?
– Two basic heuristics
• TF (Term Frequency) = Within-doc-frequency
• IDF (Inverse Document Frequency)
– TF normalization
26
MIAS Tutorial Summer 2011
TF Weighting
• Idea: A term is more important if it occurs more
frequently in a document
• Formulas: Let f(t,d) be the frequency count of term
t in doc d
– Raw TF: TF(t,d) = f(t,d)
– Maximum frequency normalization:
TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)
– “Okapi/BM25 TF”:
TF(t,d) = k f(t,d)/(f(t,d)+k(1-b+b*doclen/avgdoclen))
• Normalization of TF is very important!
27
MIAS Tutorial Summer 2011
TF Normalization
• Why?
– Document length variation
– “Repeated occurrences” are less informative than
the “first occurrence”
• Two views of document length
– A doc is long because it uses more words
– A doc is long because it has more contents
• Generally penalize long doc, but avoid overpenalizing (pivoted normalization)
28
MIAS Tutorial Summer 2011
TF Normalization (cont.)
Norm. TF
Raw TF
“Pivoted normalization”: Using avg. doc length to regularize normalization
1-b+b*doclen/avgdoclen
b varies from 0 to 1
Normalization may be affected by the similarity measure
29
MIAS Tutorial Summer 2011
IDF Weighting
• Idea: A term is more discriminative if it occurs
only in fewer documents
• Formula:
IDF(t) = log((n+1)/k)
n – total number of docs
k -- # docs with term t (doc freq)
30
MIAS Tutorial Summer 2011
TF-IDF Weighting
• TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t)
– Common in doc  high tf  high weight
– Rare in collection high idf high weight
• Imagine a word count profile, what kind of
terms would have high weights?
31
MIAS Tutorial Summer 2011
How to Measure Similarity?

D i  ( w i 1 ,..., w iN )

Q  ( w q 1 ,..., w qN )
Dot product
similarity
w  0 if a term is absent
:
 
sim ( Q , D i ) 
N
 w qj  w ij
j 1
N
 w qj  w ij
 
sim ( Q , D i ) 
Cosine :
j 1
N
 ( w qj )
j 1
(  normalized
N
2

 ( w ij )
2
j 1
dot product)
32
MIAS Tutorial Summer 2011
VS Example: Raw TF & Dot Product
doc1
information
retrieval
search
engine
information
Sim(q,doc1)=4.8*2.4+4.5*4.5 query=“information retrieval”
Sim(q,doc2)=2.4*2.4
travel
information
doc2
doc3
map
travel
government
president
congress
……
Sim(q,doc3)=0
info
IDF(faked) 2.4
doc1
doc2
doc3
2(4.8)
1(2.4 )
query
1(2.4)
retrieval travel map search engine govern president congress
4.5
2.8
3.3 2.1
5.4
2.2
3.2
4.3
1(4.5)
1(2.1)
2 (5.6) 1(3.3)
1(5.4)
1 (2.2) 1(3.2)
1(4.3)
1(4.5)
33
MIAS Tutorial Summer 2011
What Works the Best?
Error
[
]
•Use single words
•Use stat. phrases
•Remove stop words
•Stemming
•Proximity
•…
(Singhal 2001)
34
MIAS Tutorial Summer 2011
Tokenization
• Normalize lexical units: Words with similar
meanings should be mapped to the same
indexing term
• Stemming: Mapping all inflectional forms of
words to the same root form, e.g.
– computer -> compute
– computation -> compute
– computing -> compute (but king->k?)
• Porter’s Stemmer is popular for English
35
MIAS Tutorial Summer 2011
Advantages of VS Model
• Empirically effective! (Top TREC performance)
• Intuitive
• Easy to implement
• Well-studied/Most evaluated
• Warning: Many variants of TF-IDF!
36
MIAS Tutorial Summer 2011
Disadvantages of VS Model
• Assume term indepedence
• Assume query and document to be the same
• Lack of “predictive adequacy”
– Arbitrary term weighting
– Arbitrary similarity measure
• Lots of parameter tuning!
37
MIAS Tutorial Summer 2011
Model 2: Language Models
38
MIAS Tutorial Summer 2011
What is a Statistical LM?
• A probability distribution over word sequences
– p(“Today is Wednesday”)  0.001
– p(“Today Wednesday is”)  0.0000000000001
– p(“The eigenvalue is positive”)  0.00001
• Context/topic dependent!
• Can also be regarded as a probabilistic
mechanism for “generating” text, thus also called
a “generative” model
39
MIAS Tutorial Summer 2011
The Simplest Language Model
(Unigram Model)
• Generate a piece of text by generating each
word independently
• Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn)
• Parameters: {p(wi)} p(w )+…+p(w )=1 (N is voc. size)
• Essentially a multinomial distribution over
1
N
words
• A piece of text can be regarded as a sample
drawn according to this word distribution
40
MIAS Tutorial Summer 2011
Text Generation with Unigram LM
(Unigram) Language Model 
p(w| )
Sampling
Document d
…
Topic 1:
Text mining
text 0.2
mining 0.1
assocation 0.01
clustering 0.02
…
food 0.00001
…
Text mining
paper
Given , p(d| ) varies according to d
…
Topic 2:
Health
food 0.25
nutrition 0.1
healthy 0.05
diet 0.02
Food nutrition
paper
…
41
MIAS Tutorial Summer 2011
Estimation of Unigram LM
(Unigram) Language Model  Estimation
Document
p(w| )=?
…
10/100
5/100
3/100
3/100
1/100
text 10
mining 5
association 3
database 3
algorithm 2
…
query 1
efficient 1
text ?
mining ?
assocation ?
database ?
…
query ?
…
Total #words
=100
How good is the estimated model ?
It gives our document sample the highest prob,
but it doesn’t generalize well… More about this later…
42
MIAS Tutorial Summer 2011
More Sophisticated LMs
• N-gram language models
– In general, p(w1 w2 ... wn)=p(w1)p(w2|w1)…p(wn|w1 …wn-1)
– n-gram: conditioned only on the past n-1 words
– E.g., bigram: p(w1 ... wn)=p(w1)p(w2|w1) p(w3|w2) …p(wn|wn-1)
• Remote-dependence language models (e.g.,
Maximum Entropy model)
• Structured language models (e.g., probabilistic
context-free grammar)
• Will not be covered in detail in this tutorial. If
interested, read [Manning & Schutze 99]
43
MIAS Tutorial Summer 2011
Why Just Unigram Models?
• Difficulty in moving toward more complex
models
– They involve more parameters, so need more data
to estimate (A doc is an extremely small sample)
– They increase the computational complexity
significantly, both in time and space
• Capturing word order or structure may not add
so much value for “topical inference”
• But, using more sophisticated models can still
be expected to improve performance ...
44
MIAS Tutorial Summer 2011
Language Models for Retrieval
(Ponte & Croft 98)
Document
Language Model
…
Text mining
paper
text ?
mining ?
assocation ?
clustering ?
…
food ?
…
?
…
Food nutrition
paper
Query =
“data mining algorithms”
Which model would most
likely have generated
this query?
food ?
nutrition ?
healthy ?
diet ?
…
45
MIAS Tutorial Summer 2011
Ranking Docs by Query Likelihood
Doc LM
Query likelihood
d1
 d1
p(q| d1)
d2
 d2
p(q| d2)
q
p(q| dN)
dN
dN
46
MIAS Tutorial Summer 2011
Retrieval as
Language Model Estimation
• Document ranking based on query
likelihood
log p (q | d )   log p (w i | d )
i
where , q  w 1w 2 ...w n
• Retrieval
Document language model
problem  Estimation of
p(wi|d)
• Smoothing is an important issue, and
distinguishes different approaches
47
MIAS Tutorial Summer 2011
A General Smoothing Scheme
• All smoothing methods try to
– discount the probability of words seen in a doc
– re-allocate the extra probability so that unseen
words will have a non-zero probability
• Most use a reference model (collection
language model) to discriminate unseen
words
Discounted ML estimate
 p seen (w | d )
p(w | d )  
 d p(w | C )
if w is seen in d
otherwise
Collection language model
MIAS Tutorial Summer 2011
48
Smoothing & TF-IDF Weighting
• Plug in the general smoothing scheme to the
query likelihood retrieval formula, we obtain
Doc length normalization
(long doc is expected to have a smaller d)
TF weighting
log p(q | d ) 
pseen ( wi | d )
 [log 
wi  d
wi q
d
p( wi | C )
]  n log  d 
IDF weighting
 log p(w
i
| C)
i
Ignore for ranking
• Smoothing with p(w|C)  TF-IDF + length
norm.
49
MIAS Tutorial Summer 2011
Derivation of the Query Likelihood
Retrieval Formula
Discounted ML estimate
Retrieval formula using the
general smoothing scheme
if w is seen in d
 p Seen ( w | d )
p(w | d )  
  d p ( w | C ) otherw ise

1
d 
p Seen ( w | d )
w is seen

Reference language model
p(w | C )
w is unseen
log p ( q | d ) 

c( w, q ) log p ( w | d )

c( w, q ) log pSeen ( w | d ) 

c( w, q ) log pSeen ( w | d ) 

c( w, q ) log
wV ,c ( w , q )  0

wV ,c ( w , d )  0,
c ( w, q )  0

wV ,c ( w , d )  0

wV ,c ( w , d )  0
c ( w, q )  0

c( w, q ) log  d p ( w | C )
wV , c ( w, q )  0, c ( w, d ) 0

wV ,c ( w , q )  0
pSeen ( w | d )
 d p(w | C )

c ( w, q ) log  d p ( w | C ) 
 | q | log  d 
c ( w, q ) log  d p ( w | C )
wV , c ( w, q )  0, c ( w, d )  0

c( w, q ) p ( w | C )
wV ,c ( w , q )  0
Key rewriting step
Similar rewritings are very common when using LMs for IR…
50
MIAS Tutorial Summer 2011
Three Smoothing Methods
(Zhai & Lafferty 01)
• Simplified Jelinek-Mercer: Shrink uniformly
toward p(w|C)
p(w | d )  (1   )pml (w | d )   p(w | C)
• Dirichlet prior (Bayesian): Assume pseudo counts
p(w|C)
p (w | d ) 
c ( w;d )   p ( w|C )
|d | 


|d |
|d | 
pml ( w | d )  |d |  p( w | C )
• Absolute discounting: Subtract a constant 
p (w | d ) 
max( c ( w;d )  , 0 )  |d |u p ( w|C )
|d |
51
MIAS Tutorial Summer 2011
Comparison of Three Methods
Query Type
Title
Long
JM
0.228
0.278
Dir
0.256
0.276
AD
0.237
0.260
Relative performance of JM, Dir. and AD
precision
0.3
TitleQuery
0.2
LongQuery
0.1
0
JM
DIR
AD
Method
52
MIAS Tutorial Summer 2011
The Need of Query-Modeling
(Dual-Role of Smoothing)
Keyword
queries
Verbose
queries
53
MIAS Tutorial Summer 2011
Another Reason for Smoothing
Content words
Query = “the
pDML(w|d1):
0.04
pDML(w|d2):
0.02
algorithms
0.001
0.001
for
0.02
0.01
p( “algorithms”|d1) = p(“algorithm”|d2)
p( “data”|d1) < p(“data”|d2)
p( “mining”|d1) < p(“mining”|d2)
data
0.002
0.003
mining”
0.003
0.004
Intuitively, d2 should
have a higher score,
but p(q|d1)>p(q|d2)…
So we should make p(“the”) and p(“for”) less different for all docs,
and smoothing helps achieve this goal…
After smoothing
with p ( w | d )  0 . 1 p DML ( w | d )  0 . 9 p ( w | REF ), p ( q | d 1)  p ( q | d 2 )!
Query
P(w|REF)
Smoothed p(w|d1):
Smoothed p(w|d2):
= “the
0.2
0.184
0.182
algorithms
for
0.00001
0.000109
0.000109
0.2
0.182
0.181
data
mining”
0.00001
0.000209
0.000309
0.00001
0.000309
0.000409
54
MIAS Tutorial Summer 2011
Two-stage Smoothing
Stage-1
Stage-2
-Explain unseen words
-Explain noise in query
-Dirichlet prior(Bayesian) -2-component mixture


P(w|d) = (1-)
c(w,d) +p(w|C)
+ p(w|U)
|d|
+
User background model
 and  can be automatically set through statistical estimation
55
MIAS Tutorial Summer 2011
Automatic 2-stage results
 Optimal 1-stage results
Average precision (3 DB’s + 4 query types, 150 topics)
Collection
AP88-89
WSJ87-92
ZIFF1-2
query
SK
LK
SV
LV
SK
LK
SV
LV
SK
LK
SV
LV
Optimal-JM
20.3%
36.8%
18.8%
28.8%
19.4%
34.8%
17.2%
27.7%
17.9%
32.6%
15.6%
26.7%
Optimal-Dir
23.0%
37.6%
20.9%
29.8%
22.3%
35.3%
19.6%
28.2%
21.5%
32.6%
18.5%
27.9%
Auto-2stage
22.2%*
37.4%
20.4%
29.2%
21.8%*
35.8%
19.9%
28.8%*
20.0%
32.2%
18.1%
27.9%*
SK, LK, SV, LV: different types of queries
MIAS Tutorial Summer 2011
56
Feedback in Information
Retrieval
57
MIAS Tutorial Summer 2011
Feedback in IR
Query
Updated
query
Retrieval
Engine
Results:
d1 3.5
d2 2.4
…
dk 0.5
...
User
Judgments:
Document
d1 +
collection
d2 +
d3 +
Assume top 10 docs …
dk are relevant
...
Pseudo feedback
Judgments:
top 10 d +
1
d2 d3 +
…
dk ...
Feedback
Learn from
Examples
(Clickthrogh-based) Implicit feedback
Relevance feedback
User judges documents
58
MIAS Tutorial Summer 2011
Feedback in IR (cont.)
•
•
•
An essential component in any IR method
Relevance feedback is always desirable, but a user may
not be willing to provide explicit judgments
Pseudo/automatic feedback is always possible, and
often improves performance on average through
– Exploiting word co-occurrences
– Enriching a query with additional related words
– Indirectly addressing issues such as ambiguous words
and synonyms
•
Implicit feedback is a good compromise
59
MIAS Tutorial Summer 2011
Overview of Feedback Techniques
•
Feedback as machine learning: many possibilities
– Standard ML: Given examples of relevant (and non-relevant) documents, learn
how to classify a new document as either “relevant” or “non-relevant”.
– “Modified” ML: Given a query and examples of relevant (and non-relevant)
documents, learn how to rank new documents based on relevance
– Challenges:
• Sparse data
• Censored sample
• How to deal with query?
•
– Modeling noise in pseudo feedback (as semi-supervised learning)
Feedback as query expansion: traditional IR
– Step 1: Term selection
– Step 2: Query expansion
•
– Step 3: Query term re-weighting
Traditional IR is still robust (Rocchio), but ML approaches can potentially
be more accurate
60
MIAS Tutorial Summer 2011
Relevance Feedback in VS
•
Basic setting: Learn from examples
– Positive examples: docs known to be relevant
– Negative examples: docs known to be non-relevant
– How do you learn from this to improve performance?
•
General method: Query modification
– Adding new (weighted) terms
– Adjusting weights of old terms
– Doing both
•
The most well-known and effective approach is
Rocchio [Rocchio 1971]
61
MIAS Tutorial Summer 2011
Rocchio Feedback: Illustration
--- --+
+
+
+
q
q
-- +
- +++ + +
- - - + +++ + ++
-- -- -62
MIAS Tutorial Summer 2011
Rocchio Feedback: Formula
Parameters
New query
Origial query
Rel docs
Non-rel docs
63
MIAS Tutorial Summer 2011
Feedback in Probabilistic Models
Classic Prob. Model
O(R  1 | Q, D) 
P ( D | Q , R  1)
Rel. doc model
P ( D | Q , R  0)
NonRel. doc model
Query likelihood
(“Language Model”)
(q1,d1,1)
(q1,d2,1)
(q1,d3,1)
(q1,d4,0)
Parameter (q1,d5,0)
Estimation
(q3,d1,1)
(q4,d1,1)
(q5,d1,1)
(q6,d2,1)
(q6,d3,0)
O ( R  1 | Q , D )  P ( Q | D , R  1)
P(D|Q,R=1)
P(D|Q,R=0)
P(Q|D,R=1)
“Rel. query” model
Initial retrieval:
- query as rel doc; doc as rel query
- P(Q|D,R=1) is more accurate
Feedback:
- P(D|Q,R=1) can be improved for the
current query and future doc
- P(Q|D,R=1) can also be improved, but
for current doc and future query
64
MIAS Tutorial Summer 2011
Kullback-Leibler (KL) Divergence
Retrieval Model
•
Unigram similarity model
query entropy
(ignored for ranking)
Sim(d ; q)   D(ˆQ || ˆD )
  p( w | ˆQ ) log p(w | ˆD )  ( p( w | ˆQ ) log p(w | ˆQ ))
•
w
w
Retrieval  Estimation of Q and D
sim(q; d )

wi  d
p ( wi |Q )  0
•
pseen ( wi | d )
ˆ
[ p ( wi | Q ) log
]  log  d
 d p ( wi | C )
Special case: ˆQ = empirical distribution of q
“query-likelihood”
recovers
65
MIAS Tutorial Summer 2011
Feedback as Model Interpolation
Document D
D
D( Q ||  D )
Query Q
Q
 Q '  (1   ) Q  F
=0
Q '  Q
No feedback
Results
=1
F
Feedback Docs
F={d1, d2 , …, dn}
Generative model
Q '   F
Full feedback
66
MIAS Tutorial Summer 2011
Generative Mixture Model
Background words

P(w| C)
w
F={d1, …, dn}
P(source)
Topic words
1-
P(w|  )
w
log p( F |  )   c( w; di ) log[(1   ) p( w |  )   p( w | C )]
i
w
Maximum Likelihood
 F  arg max log p(F |  )

 = Noise in feedback documents
67
MIAS Tutorial Summer 2011
How to Estimate F?
Known
Background
p(w|C)
the 0.2
a 0.1
we 0.01
to 0.02
…
text 0.0001
mining 0.00005
=0.7
Observed
Doc(s)
…
Unknown
query topic
p(w|F)=?
…
“Text mining”
…
text =?
mining =?
association =?
word =?
ML
Estimator
=0.3
Suppose,
we know
the identity of each word ...
68
MIAS Tutorial Summer 2011
Can We Guess the Identity?
Identity (“hidden”) variable: zi {1 (background), 0(topic)}
zi
the
paper
presents
a
text
mining
algorithm
the
paper
...
Suppose the parameters are all known, what’s a
reasonable guess of zi?
- depends on  (why?)
- depends on p(w|C) and p(w|F) (how?)
1
1
1
1
0
0
0
1
0
...
p ( z i  1 | wi ) 

p
new
( wi |  F ) 
p ( z i  1) p ( w i | z i  1)
p ( z i  1) p ( w i | z i  1)  p ( z i  0) p ( w i | z i  0)
 p ( wi | C )
 p ( w i | C )  (1   ) p ( w i |  F )
c ( w i , F )(1  p ( z i  1 | w i ))

c ( w j , F )(1  p ( z j  1 | w i ))
E-step
M-step
wj
Initially, set p(w| F) to some random value, then iterate …
69
MIAS Tutorial Summer 2011
An Example of EM Computation
p
p
(n)
( z i  1 | wi ) 
( n  1)
( wi |  F ) 
Expectation-Step:
Augmenting data by guessing hidden variables
 p ( wi | C )
 p ( w i | C )  (1   ) p
c ( w i , F )(1  p
 c(w
j
(n)
, F )(1  p
wj
(n)
( wi |  F )
( z i  1 | w i ))
(n)
( z j  1 | w i ))
Maximization-Step
With the “augmented data”, estimate parameters
using maximum likelihood
Assume =0.5
Word
#
P(w|C)
The
4
0.5
Paper
2
0.3
Text
4
0.1
Mining
2
0.1
Log-Likelihood
Iteration 1
P(w|F) P(z=1)
0.67
0.25
0.55
0.25
0.29
0.25
0.29
0.25
-16.96
Iteration 2
P(w|F) P(z=1)
0.71
0.20
0.68
0.14
0.19
0.44
0.31
0.22
-16.13
Iteration 3
P(w|F) P(z=1)
0.74
0.18
0.75
0.10
0.17
0.50
0.31
0.22
-16.02
70
MIAS Tutorial Summer 2011
Example of Feedback Query Model
Trec topic 412: “airport security”
=0.9
W
security
airport
beverage
alcohol
bomb
terrorist
author
license
bond
counter-terror
terror
newsnet
attack
operation
headline
Mixture model approach
p(W|  F )
0.0558
0.0546
0.0488
0.0474
0.0236
0.0217
0.0206
0.0188
0.0186
0.0173
0.0142
0.0129
0.0124
0.0121
0.0121
Web database
Top 10 docs
=0.7
W
the
security
airport
beverage
alcohol
to
of
and
author
bomb
terrorist
in
license
state
by
p(W|  F )
0.0405
0.0377
0.0342
0.0305
0.0304
0.0268
0.0241
0.0214
0.0156
0.0150
0.0137
0.0135
0.0127
0.0127
0.0125
71
MIAS Tutorial Summer 2011
Part 1.3 Evaluation in
Information Retrieval
72
MIAS Tutorial Summer 2011
Evaluation Criteria
• Effectiveness/Accuracy
– Precision, Recall
• Efficiency
– Space and time complexity
• Usability
– How useful for real user tasks?
73
MIAS Tutorial Summer 2011
Methodology: Cranfield Tradition
• Laboratory testing of system components
– Precision, Recall
– Comparative testing
• Test collections
– Set of documents
– Set of questions
– Relevance judgments
74
MIAS Tutorial Summer 2011
The Contingency Table
Action
Retrieved
Not Retrieved
Relevant Retrieved
Relevant Rejected
Doc
Relevant
Not relevant Irrelevant Retrieved Irrelevant Rejected
Precision

Relevant Retrieved
Retrieved
Recall 
Relevant Retrieved
Relevant
75
MIAS Tutorial Summer 2011
How to measure a ranking?
• Compute the precision at every recall point
• Plot a precision-recall (PR) curve
precision
precision
x
Which is better?
x
x
x
x
x
x
recall
x
recall
76
MIAS Tutorial Summer 2011
Summarize a Ranking: MAP
•
Given that n docs are retrieved
– Compute the precision (at rank) where each (new) relevant document
is retrieved => p(1),…,p(k), if we have k rel. docs
•
•
•
– E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2.
– If a relevant document never gets retrieved, we assume the precision
corresponding to that rel. doc to be zero
Compute the average over all the relevant documents
– Average precision = (p(1)+…p(k))/k
This gives us an average precision, which captures both
precision and recall and is sensitive to the rank of each relevant
document
Mean Average Precisions (MAP)
– MAP = arithmetic mean average precision over a set of topics
– gMAP = geometric mean average precision over a set of topics (more
affected by difficult topics)
77
MIAS Tutorial Summer 2011
Summarize a Ranking: NDCG
•
•
•
What if relevance judgments are in a scale of [1,r]? r>2
Cumulative Gain (CG) at rank n
– Let the ratings of the n documents be r1, r2, …rn (in ranked order)
– CG = r1+r2+…rn
Discounted Cumulative Gain (DCG) at rank n
– DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
– We may use any base for the logarithm, e.g., base=b
•
– For rank positions above b, do not discount
Normalized Cumulative Gain (NDCG) at rank n
– Normalize DCG at rank n by the DCG value at rank n of the ideal
ranking
– The ideal ranking would first return the documents with the highest
relevance level, then the next highest relevance level, etc
•
– Compute the precision (at rank) where each (new) relevant
document is retrieved => p(1),…,p(k), if we have k rel. docs
NDCG is now quite popular in evaluating Web search
MIAS Tutorial Summer 2011
78
Other Measures
• Precision at k documents (e.g., [email protected]):
– more meaningful than MAP (why?)
– also called breakeven precision when k is the same as
the number of relevant documents
• Mean Reciprocal Rank (MRR):
– Same as MAP when there’s only 1 relevant document
– Reciprocal Rank = 1/Rank-of-the-relevant-doc
• F-Measure (F1): harmonic mean of precision and
recall
F 

2
 1
2
F1 
1
R
2 PR
PR

(   1) P * R
2
1
1
 1
2
1
P

 PR
2
P: precision
R: recall
: parameter
(often set to 1)
MIAS Tutorial Summer 2011
79
Typical TREC Evaluation Result
Precion-Recall Curve
Out of 4728 rel docs,
we’ve got 3212
Recall=3212/4728
[email protected]
about 5.5 docs
in the top 10 docs
are relevant
Breakeven Precision
(precision when prec=recall)
Mean Avg. Precision (MAP)
D1 +
D2 +
D3 –
D4 –
D5 +
D6 -
Total # rel docs = 4
System returns 6 docs
Average Prec = (1/1+2/2+3/5+0)/4
MIAS Tutorial Summer 2011
80
What Query Averaging Hides
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Slide from Doug Oard’s presentation, originally from Ellen Voorhees’ presentation
81
MIAS Tutorial Summer 2011
Part 1.4 Information Retrieval
Systems
82
MIAS Tutorial Summer 2011
IR System Architecture
docs
INDEXING
Doc
Rep
SEARCHING
Query
Rep
Ranking
Feedback
query
User
results
INTERFACE
judgments
QUERY MODIFICATION
83
MIAS Tutorial Summer 2011
Indexing
• Indexing = Convert documents to data
structures that enable fast search
• Inverted index is the dominating indexing
method (used by all search engines)
• Other indices (e.g., document index) may be
needed for feedback
84
MIAS Tutorial Summer 2011
Inverted Index
• Fast access to all docs containing a given
term (along with freq and pos information)
• For each term, we get a list of tuples (docID,
freq, pos).
• Given a query, we can fetch the lists for all
query terms and work on the involved
documents.
– Boolean query: set operation
– Natural language query: term weight summing
• More efficient than scanning docs (why?)
85
MIAS Tutorial Summer 2011
Inverted Index Example
Doc 1
This is a sample
document
with one sample
sentence
Doc 2
Dictionary
Term
#
docs
Total
freq
This
2
2
is
2
2
sample
2
3
another
1
1
…
…
…
This is another
sample document
Postings
Doc id
Freq
1
1
2
1
1
1
2
1
1
2
2
1
2
1
…
…
…
…
86
MIAS Tutorial Summer 2011
Data Structures
for Inverted Index
• Dictionary: modest size
– Needs fast random access
– Preferred to be in memory
– Hash table, B-tree, trie, …
• Postings: huge
– Sequential access is expected
– Can stay on disk
– May contain docID, term freq., term pos, etc
– Compression is desirable
87
MIAS Tutorial Summer 2011
Inverted Index Compression
• Observations
– Inverted list is sorted (e.g., by docid or termfq)
– Small numbers tend to occur more frequently
• Implications
– “d-gap” (store difference): d1, d2-d1, d3-d2-d1,…
– Exploit skewed frequency distribution: fewer bits
for small (high frequency) integers
•
Binary code, unary code, -code, -code
88
MIAS Tutorial Summer 2011
Integer Compression Methods
• In general, to exploit skewed distribution
• Binary: equal-length coding
• Unary: x1 is coded as x-1 one bits followed
by 0, e.g., 3=> 110; 5=>11110
• -code: x=> unary code for 1+log x followed
by uniform code for x-2 log x in log x bits,
e.g., 3=>101, 5=>11001
• -code: same as -code ,but replace the unary
prefix with -code. E.g., 3=>1001, 5=>10101
89
MIAS Tutorial Summer 2011
Constructing Inverted Index
• The main difficulty is to build a huge index
with limited memory
• Memory-based methods: not usable for large
collections
• Sort-based methods:
– Step 1: collect local (termID, docID, freq) tuples
– Step 2: sort local tuples (to make “runs”)
– Step 3: pair-wise merge runs
– Step 4: Output inverted file
90
MIAS Tutorial Summer 2011
Sort-based Inversion
Sort by doc-id
doc1
<1,1,3>
<2,1,2>
<3,1,1>
...
<1,2,2>
<3,2,3>
<4,2,2>
…
doc1
Sort by term-id
<1,1,3>
<1,2,2>
<2,1,2>
<2,4,3>
...
<1,5,3>
<1,6,2>
…
All info about term 1
<1,1,3>
<1,2,2>
<1,5,2>
<1,6,3>
...
<1,300,3>
<2,1,2>
…
...
doc300
Term
Lexicon:
the 1
cold 2
days 3
a4
...
DocID
Lexicon:
<1,300,3>
<3,300,1>
...
<1,299,3>
<1,300,1>
...
Parse & Count
“Local” sort
<5000,299,1>
<5000,300,1>
...
Merge sort
MIAS Tutorial Summer 2011
doc1 1
doc2 2
doc3 3
...
91
Searching
• Given a query, score documents efficiently
• Boolean query
– Fetch the inverted list for all query terms
– Perform set operations to get the subset of docs
that satisfy the Boolean condition
– E.g., Q1=“info” AND “security” , Q2=“info” OR “security”
• info: d1, d2, d3, d4
• security: d2, d4, d6
• Results: {d2,d4} (Q1) {d1,d2,d3,d4,d6} (Q2)
92
MIAS Tutorial Summer 2011
Ranking Documents
• Assumption:score(d,q)=f[g(w(d,q,t ),…w(d,q,t )),
1
w(d),w(q)], where, ti’s are the matched terms
n
• Maintain a score accumulator for each doc to
compute function g
• For each query term ti
– Fetch the inverted list {(d1,f1),…,(dn,fn)}
– For each entry (dj,fj), Compute w(dj,q,ti), and
Update score accumulator for doc di
• Adjust the score to compute f, and sort
93
MIAS Tutorial Summer 2011
Ranking Documents: Example
Query = “info security”
S(d,q)=g(t1)+…+g(tn) [sum of freq of matched terms]
Info: (d1, 3), (d2, 4), (d3, 1), (d4, 5)
Security: (d2, 3), (d4,1), (d5, 3)
Accumulators: d1
0
(d1,3) => 3
(d2,4) => 3
info (d3,1) => 3
(d4,5) => 3
(d2,3) => 3
security (d4,1) => 3
(d5,3) => 3
d2
0
0
4
4
4
7
7
7
d3
0
0
0
1
1
1
1
1
d4
0
0
0
0
5
5
6
6
d5
0
0
0
0
0
0
0
3
94
MIAS Tutorial Summer 2011
Further Improving Efficiency
• Keep only the most promising accumulators
• Sort the inverted list in decreasing order of
weights and fetch only N entries with the
highest weights
• Pre-compute as much as possible
95
MIAS Tutorial Summer 2011
Open Source IR Toolkits
• Smart (Cornell)
• MG (RMIT & Melbourne, Australia; Waikato,
New Zealand),
• Lemur (CMU/Univ. of Massachusetts)
• Terrier (Glasgow)
• Lucene (Open Source)
96
MIAS Tutorial Summer 2011
Smart
• The most influential IR system/toolkit
• Developed at Cornell since 1960’s
• Vector space model with lots of weighting
options
• Written in C
• The Cornell/AT&T groups have used the Smart
system to achieve top TREC performance
97
MIAS Tutorial Summer 2011
MG
• A highly efficient toolkit for retrieval of text
and images
• Developed by people at Univ. of Waikato, Univ.
of Melbourne, and RMIT in 1990’s
• Written in C, running on Unix
• Vector space model with lots of compression
and speed up tricks
• People have used it to achieve good TREC
performance
98
MIAS Tutorial Summer 2011
Lemur/Indri
• An IR toolkit emphasizing language models
• Developed at CMU and Univ. of Massachusetts
in 2000’s
• Written in C++, highly extensible
• Vector space and probabilistic models
including language models
• Achieving good TREC performance with a
simple language model
99
MIAS Tutorial Summer 2011
Terrier
• A large-scale retrieval toolkit with lots of
applications (e.g., desktop search) and TREC
support
• Developed at University of Glasgow, UK
• Written in Java, open source
• “Divergence from randomness” retrieval
model and other modern retrieval formulas
100
MIAS Tutorial Summer 2011
Lucene
• Open Source IR toolkit
• Initially developed by Doug Cutting in Java
• Now has been ported to some other languages
• Good for building IR/Web applications
• Many applications have been built using
Lucene (e.g., Nutch Search Engine)
•
Currently the retrieval algorithms have poor
accuracy
101
MIAS Tutorial Summer 2011
Part 1.5: Information Filtering
102
MIAS Tutorial Summer 2011
Short vs. Long Term Info Need
• Short-term information need (Ad hoc retrieval)
– “Temporary need”, e.g., info about used cars
– Information source is relatively static
– User “pulls” information
– Application example: library search, Web search
• Long-term information need (Filtering)
– “Stable need”, e.g., new data mining algorithms
– Information source is dynamic
– System “pushes” information to user
– Applications: news filter
103
MIAS Tutorial Summer 2011
Examples of Information Filtering
• News filtering
• Email filtering
• Movie/book recommenders
• Literature recommenders
• And many others …
104
MIAS Tutorial Summer 2011
Content-based Filtering vs.
Collaborative Filtering
• Basic filtering question: Will user U like item
X?
• Two different ways of answering it
– Look at what U likes
=> characterize X => content-based filtering
– Look at who likes X
=> characterize U => collaborative filtering
• Can be combined
105
MIAS Tutorial Summer 2011
Content-based filtering is also called
1. “Adaptive Information Filtering” in TREC
2. “Selective Dissemination of Information (SDI)
in Library & Information Science
Collaborative filtering is also called
“Recommender Systems”
106
MIAS Tutorial Summer 2011
1. Adaptive Filtering
107
MIAS Tutorial Summer 2011
Adaptive Information Filtering
• Stable & long term interest, dynamic info source
• System must make a delivery decision
immediately as a document “arrives”
my interest:
…
Filtering
System
108
MIAS Tutorial Summer 2011
AIF vs. Retrieval, & Categorization
• Like retrieval over a dynamic stream of docs,
but ranking is impossible
• Like online binary categorization, but with
no
initial training data and with limited feedback
• Typically evaluated with a utility function
– Each delivered doc gets a utility value
– Good doc gets a positive value
– Bad doc gets a negative value
– E.g., Utility = 3* #good - 2 *#bad (linear utility)
109
MIAS Tutorial Summer 2011
A Typical AIF System
Initialization
...
Doc Source
Accumulated
Docs
Binary
Classifier
User profile
text
Accepted Docs
User
User
Interest
Profile
utility func
Learning
Feedback
110
MIAS Tutorial Summer 2011
Three Basic Problems in AIF
• Making filtering decision (Binary classifier)
– Doc text, profile text  yes/no
• Initialization
– Initialize the filter based on only the profile text or
very few examples
• Learning from
– Limited relevance judgments (only on “yes” docs)
– Accumulated documents
• All trying to maximize the utility
111
MIAS Tutorial Summer 2011
Major Approaches to AIF
• “Extended” retrieval systems
– “Reuse” retrieval techniques to score documents
– Use a score threshold for filtering decision
– Learn to improve scoring with traditional feedback
– New approaches to threshold setting and learning
• “Modified” categorization systems
– Adapt to binary, unbalanced categorization
– New approaches to initialization
– Train with “censored” training examples
112
MIAS Tutorial Summer 2011
A General Vector-Space Approach
doc
vector
Scoring
no
Thresholding
Utility
Evaluation
yes
profile vector
threshold
Vector
Learning
Threshold
Learning
Feedback
Information
113
MIAS Tutorial Summer 2011
Difficulties in Threshold Learning
36.5
33.4
32.1
29.9
27.3
…
...
R
N
R
?
?
=30.0
• Censored data
• Little/none labeled data
• Scoring bias due to
vector learning
• Exploration vs.
Exploitation
114
MIAS Tutorial Summer 2011
Empirical Utility Optimization
• Basic idea
– Compute the utility on the training data for each
candidate threshold (score of a training doc)
– Choose the threshold that gives the maximum
utility
• Difficulty: Biased training sample!
– We can only get an upper bound for the true
optimal threshold.
• Solutions:
– Heuristic adjustment(lowering) of threshold
– Lead to “beta-gamma threshold learning”
115
MIAS Tutorial Summer 2011
Beta-Gamma Threshold Learning
[Zhai et al. 2000]
Utility
Encourage exploration
up to zero

θ optimal
θ  α * θ zero  (1 - α  * θ optimal
θ z er o

0123…
K ...

Cutoff position
, N
α  β  (1 - β  * e
N  # training
, [0,1]
 N *γ
examples
The more examples,
the less exploration
(closer to optimal)
116
MIAS Tutorial Summer 2011
Beta-Gamma Threshold Learning
(cont.)
• Pros
– Explicitly addresses exploration-exploitation
tradeoff (“Safe” exploration)
– Arbitrary utility (with appropriate lower bound)
– Empirically effective
• Cons
– Purely heuristic
– Zero utility lower bound often too conservative
117
MIAS Tutorial Summer 2011
2. Collaborative Filtering
118
MIAS Tutorial Summer 2011
What is Collaborative Filtering
(CF)?
• Making filtering decisions for an individual
user based on the judgments of other users
• Inferring individual’s interest/preferences
from that of other similar users
• General idea
– Given a user u, find similar users {u1, …, um}
– Predict u’s preferences based on the preferences
of u1, …, um
119
MIAS Tutorial Summer 2011
CF: Assumptions
• Users with a common interest will have similar
preferences
• Users with similar preferences probably share
the same interest
• Examples
– “interest is IR” => “favor SIGIR papers”
– “favor SIGIR papers” => “interest is IR”
• Sufficiently large number of user preferences
are available
120
MIAS Tutorial Summer 2011
CF: Intuitions
• User similarity (Kevin Chang vs. Jiawei Han)
– If Kevin liked the paper, Jiawei will like the paper
– ? If Kevin liked the movie, Jiawei will like the movie
– Suppose Kevin and Jiawei viewed similar movies in
the past six months …
• Item similarity
– Since 90% of those who liked Star Wars also liked
Independence Day, and, you liked Star Wars
– You may also like Independence Day
The content of items “didn’t matter”!
121
MIAS Tutorial Summer 2011
Rating-based vs. Preference-based
• Rating-based: User’s preferences are
encoded using numerical ratings on items
– Complete ordering
– Absolute values can be meaningful
– But, values must be normalized to combine
• Preference-based: User’s preferences are
represented by partial ordering of items
– Partial ordering
– Easier to exploit implicit preferences
122
MIAS Tutorial Summer 2011
A Formal Framework for Rating
Objects: O
o1
o2
u1
u2
3
1.5 …. …
…
2
Users: U
ui
1
… oj … on
2
The task
?
•
•
...
um
Xij=f(ui,oj)=?
3
•
Unknown function
f: U x O R
MIAS Tutorial Summer 2011
Assume known f values for
some (u,o)’s
Predict f values for other
(u,o)’s
Essentially function
approximation, like other
learning problems
123
Where are the intuitions?
• Similar users have similar preferences
– If u  u’, then for all o’s, f(u,o)  f(u’,o)
• Similar objects have similar user preferences
– If o  o’, then for all u’s, f(u,o)  f(u,o’)
• In general, f is “locally constant”
– If u  u’ and o  o’, then f(u,o)  f(u’,o’)
– “Local smoothness” makes it possible to predict
unknown values by interpolation or extrapolation
• What does “local” mean?
124
MIAS Tutorial Summer 2011
Two Groups of Approaches
• Memory-based
approaches
– f(u,o) = g(u)(o)  g(u’)(o) if u  u’
(g =preference function)
– Find “neighbors” of u and combine g(u’)(o)’s
• Model-based approaches
– Assume structures/model: object cluster, user
cluster, f’ defined on clusters
– f(u,o) = f’(cu, co)
– Estimation & Probabilistic inference
125
MIAS Tutorial Summer 2011
Memory-based Approaches
(Breese et al. 98)
• General ideas:
– Xij: rating of object j by user i
– ni: average rating of all objects by user i
– Normalized ratings: Vij = Xij - ni
– Memory-based prediction
m
vˆ aj  k  w ( a , i )v ij
m
xˆ aj  vˆ aj  n a
k  1 /  w (a , i)
• Specific approaches differ in w(a,i) -- the
i 1
i 1
distance/similarity between user a and i
126
MIAS Tutorial Summer 2011
User Similarity Measures
• Pearson correlation coefficient (sum over
commonly rated items)
 (x
w p (a, i ) 
aj
 n a )( x ij  n i )
j
 (x
aj
 na )
2
j
• Cosine measure
 (x
ij
 ni )
2
j
n
x
w c (a, i ) 
aj
x ij
j 1
n
n
x x
2
aj
j 1
2
ij
j 1
• Many other possibilities!
MIAS Tutorial Summer 2011
127
Improving User Similarity
Measures (Breese et al. 98)
• Dealing with missing values: default
ratings
• Inverse User Frequency (IUF): similar to
IDF
• Case Amplification: use w(a,I)p, e.g., p=2.5
128
MIAS Tutorial Summer 2011
Part 2: Text Mining Techniques
- 2.1 Overview of Text Mining
- 2.2 IR-Style Text Mining Techniques
- 2.3 NLP-Style Text Mining Techniques
- 2.4 ML-Style Text Mining Techniques
129
MIAS Tutorial Summer 2011
Part 2.1: Overview of Text
Mining
130
MIAS Tutorial Summer 2011
Two Different Definitions of Mining
• Goal-oriented (effectiveness driven)
– Any process that generates useful results that are
non-obvious is called “mining”.
– Keywords: “useful” + “non-obvious”
– Data isn’t necessarily massive
• Method-oriented (efficiency driven)
– Any process that involves extracting information
from massive data is called “mining”
– Keywords: “massive” + “pattern”
– Patterns aren’t necessarily useful
131
MIAS Tutorial Summer 2011
Most Data are Unstructured (Text)
or Semi-Structured…
•
•
•
•
•
•
•
Email
Insurance claims
News articles
Web pages
Patent portfolios
…
•
•
•
•
Customer complaint
letters
Contracts
Transcripts of phone
calls with customers
Technical documents
…
Text data mining has become more and more important…
(Adapted from J. Dorre et al. “Text Mining:
Finding Nuggets in Mountains of Textual Data”)
132
MIAS Tutorial Summer 2011
What is Text Mining?
• Data Mining View: Explore patterns in textual
data
– Find latent topics
– Find topical trends
– Find outliers and other hidden patterns
• Natural Language Processing View: Make
inferences based on partial understanding
natural language text
– Information extraction
– Question answering
133
MIAS Tutorial Summer 2011
Applications of Text Mining
•
Direct applications
– Discovery-driven (Bioinformatics, Business Intelligence, etc):
We have specific questions; how can we exploit data mining to
answer the questions?
– Data-driven (WWW, literature, email, customer reviews, etc):
We have a lot of data; what can we do with it?
•
Indirect applications
– Assist information access (e.g., discover latent topics to better
summarize search results)
– Assist information organization (e.g., discover hidden
structures)
134
MIAS Tutorial Summer 2011
Text Mining Methods
•
Data Mining Style: View text as high dimensional data
– Frequent pattern finding
– Association analysis
•
– Outlier detection
Information Retrieval Style: Fine granularity topical analysis
– Topic extraction
– Exploit term weighting and text similarity measures
•
– Question answering
Natural Language Processing Style: Information Extraction
– Entity extraction
– Relation extraction
•
– Sentiment analysis
Machine Learning Style: Unsupervised or semi-supervised learning
– Mixture models
– Dimension reduction
135
MIAS Tutorial Summer 2011
Part 2.2: IR-Style Techniques
for Text Mining
136
MIAS Tutorial Summer 2011
Some “Basic” IR Techniques
• Stemming
• Stop words
• Weighting of terms (e.g., TF-IDF)
• Vector/Unigram representation of text
• Text similarity (e.g., cosine, KL-div)
• Relevance/pseudo feedback (e.g., Rocchio)
They are not just for retrieval!
137
MIAS Tutorial Summer 2011
Generality of Basic Techniques
t1 t 2 … t n
d1 w11 w12… w1n
d2 w21 w22… w2n
……
…
dm wm1 wm2… wmn
Term
similarity
CLUSTERING
Doc
similarity
Stemming & Stop words
Raw text
tt
t
t tt
d
d dd
d
d
dd
d d
d d
dd
Term Weighting
Tokenized text
tt
t t tt
Sentence
selection
SUMMARIZATION
META-DATA/
ANNOTATION
Vector
centroid
d
CATEGORIZATION
138
MIAS Tutorial Summer 2011
Text Categorization
• Pre-given categories and labeled document
examples (Categories may form hierarchy)
• Classify new documents
• A standard supervised learning problem
Sports
Categorization
System
Business
Education
…
Sports
Business
…
Science
Education
139
MIAS Tutorial Summer 2011
“Retrieval-based” Categorization
• Treat each category as representing an
“information need”
• Treat examples in each category as “relevant
documents”
• Use feedback approaches to learn a good “query”
• Match all the learned queries to a new document
• A document gets the category(categories)
represented by the best matching query(queries)
140
MIAS Tutorial Summer 2011
Prototype-based Classifier
• Key elements (“retrieval techniques”)
– Prototype/document representation (e.g., term vector)
– Document-prototype distance measure (e.g., dot product)
– Prototype vector learning: Rocchio feedback
• Example
141
MIAS Tutorial Summer 2011
K-Nearest Neighbor Classifier
•
•
•
•
•
Keep all training examples
Find k examples that are most similar to the new
document (“neighbor” documents)
Assign the category that is most common in these
neighbor documents (neighbors vote for the
category)
Can be improved by considering the distance of a
neighbor ( A closer neighbor has more influence)
Technical elements (“retrieval techniques”)
– Document representation
– Document distance measure
142
MIAS Tutorial Summer 2011
Example of K-NN Classifier
(k=4)
(k=1)
143
MIAS Tutorial Summer 2011
The Clustering Problem
• Discover “natural structure”
• Group similar objects together
• Object can be document, term, passages
• Example
144
MIAS Tutorial Summer 2011
Similarity-based Clustering
(as opposed to “model-based”)
• Define a similarity function to measure
similarity between two objects
• Gradually group similar objects together in a
bottom-up fashion
• Stop when some stopping criterion is met
• Variations: different ways to compute group
similarity based on individual object similarity
145
MIAS Tutorial Summer 2011
Similarity-induced Structure
146
MIAS Tutorial Summer 2011
How to Compute Group Similarity?
Three Popular Methods:
Given two groups g1 and g2,
Single-link algorithm: s(g1,g2)= similarity of the closest pair
complete-link algorithm: s(g1,g2)= similarity of the farthest pair
average-link algorithm: s(g1,g2)= average of similarity of all pairs
147
MIAS Tutorial Summer 2011
Three Methods Illustrated
complete-link algorithm
g2
g1
?
……
Single-link algorithm
average-link algorithm
148
MIAS Tutorial Summer 2011
The Summarization Problem
• Essentially “semantic compression” of text
• Selection-based vs. generation-based
summary
• In general, we need a purpose for
summarization, but it’s hard to define it
149
MIAS Tutorial Summer 2011
“Retrieval-based” Summarization
• Observation: term vector  summary?
• Basic approach
– Rank “sentences”, and select top N as a summary
• Methods for ranking sentences
– Based on term weights
– Based on position of sentences
– Based on the similarity of sentence and document
vector
150
MIAS Tutorial Summer 2011
Simple Discourse Analysis
-------------------------------------------------------------------------------------------------------------------------------------------------
vector 1
vector 2
vector 3
…
…
similarity
similarity
vector n-1
similarity
vector n
151
MIAS Tutorial Summer 2011
A Simple Summarization Method
-------------------------------------------------------------------------------------------------------------------------------------------------
summary
sentence 1
sentence 2
Most similar
in each segment
Doc vector
sentence 3
152
MIAS Tutorial Summer 2011
Part 2.3: NLP-Style Text Mining
Techniques
Most of the following slides are from William Cohen’s IE tutorial
153
MIAS Tutorial Summer 2011
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic
philosophy of open-source software with
Orwellian fervor, denouncing its communal
licensing as a "cancer" that stifled
technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers.
Gates himself says Microsoft will gladly
disclose its crown jewels--the coveted code
behind the Windows operating system--to
select customers.
"We can be open source. We love the
concept of shared source," said Bill Veghte,
a Microsoft VP. "That's a super-important
shift for us in terms of code access.“
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
Richard Stallman, founder of the Free
Software Foundation, countered saying…
154
MIAS Tutorial Summer 2011
IE History
Pre-Web
•
Mostly news articles
– De Jong’s FRUMP [1982]
• Hand-built system to fill Schank-style “scripts” from news wire
•
– Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92’96]
Early work dominated by hand-built models
– E.g. SRI’s FASTUS, hand-built FSMs.
– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and
then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]
Web
•
•
•
AAAI ’94 Spring Symposium on “Software Agents”
– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
Tom Mitchell’s WebKB, ‘96
– Build KB’s from the Web.
Wrapper Induction
– Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…
– Citeseer; Cora; FlipDog; contEd courses, corpInfo, …
155
MIAS Tutorial Summer 2011
IE History
Biology
•
•
•
Gene/protein entity extraction
Protein/protein fact interaction
Automated curation/integration of databases
– At CMU: SLIF (Murphy et al, subcellular information from
images + text in journal articles)
Email
•
EPCA, PAL, RADAR, CALO: intelligent office assistant that
“understands” some part of email
– At CMU: web site update requests, office-space requests;
calendar scheduling requests; social network analysis of
email.
156
MIAS Tutorial Summer 2011
Landscape of IE Tasks:
E.g. word patterns:
Complexity
Regular set
Closed set
U.S. states
U.S. phone numbers
He was born in Alabama…
Phone: (413) 545-1323
The big Wyoming sky…
The CALD main office can be
reached at 412-268-1299
Complex pattern
Ambiguous patterns,
needing context and
many sources of evidence
U.S. postal addresses
University of Arkansas
P.O. Box 140
Hope, AR 71802
Person names
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
…was among the six houses
sold by Hope Feldman that year.
Pawel Opalinski, Software
Engineer at WhizBang Labs.
157
MIAS Tutorial Summer 2011
Landscape of IE Tasks:
Single Field/Record
Jack Welch will retire as CEO of General Electric tomorrow. The top role
at the Connecticut company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
Person: Jack Welch
Relation: Person-Title
Person: Jack Welch
Title:
CEO
Person: Jeffrey Immelt
Location: Connecticut
N-ary record
Relation:
Company:
Title:
Out:
In:
Succession
General Electric
CEO
Jack Welsh
Jeffrey Immelt
Relation: Company-Location
Company: General Electric
Location: Connecticut
“Named entity” extraction
158
MIAS Tutorial Summer 2011
Landscape of IE Techniques
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
Any of these models can be used to capture words, formatting or both.
159
MIAS Tutorial Summer 2011
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Pedro Domingos spoke this example sentence.
and a trained HMM:
person name
location name
background
Find the most likely state sequence: (Viterbi) arg max

s
 
P(s , o )
Yesterday Pedro Domingos spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Pedro Domingos
MIAS Tutorial Summer 2011
160
HMM for Segmentation
•
Simplest Model: One state per entity type
161
MIAS Tutorial Summer 2011
Discriminative Approaches
Yesterday Pedro Domingos spoke this example sentence.
Is this phrase (X) a name? Y=1 (yes);Y=0 (no)
Learn from many examples to predict Y from X
Maximum Entropy, Logistic Regression:
p (Y | X ) 
1
Z
parameters
n
exp(

i
f i ( X , Y ))
i 1
Features (e.g., is the phrase capitalized?)
More sophisticated: Consider dependency between different labels
(e.g. Conditional Random Fields)
162
MIAS Tutorial Summer 2011
Part 2.4 Statistical Learning Style
Techniques for Text Mining
163
MIAS Tutorial Summer 2011
Comparative Text Mining (CTM)
Problem definition:
 Given a comparable set of text collections
 Discover & analyze their common and unique properties
Collection C1
Collection C2 ….
Collection Ck
Common themes
C1specific
themes
C2specific
themes
Ckspecific
themes
164
MIAS Tutorial Summer 2011
Example: Summarizing Customer
Reviews
IBM Laptop
Reviews
APPLE Laptop
Reviews
DELL Laptop
Reviews
Common Themes
“IBM” specific
“APPLE” specific
“DELL” specific
Battery Life
Long, 4-3 hrs
Medium, 3-2 hrs
Short, 2-1 hrs
Hard disk
Large, 80-100 GB
Small, 5-10 GB
Medium, 20-50
GB
Speed
Slow, 100-200 Mhz
Very Fast, 3-4 Ghz
Moderate, 1-2 Ghz
Ideal results from comparative text mining
165
MIAS Tutorial Summer 2011
A More Realistic Setup of CTM
IBM Laptop Reviews
Common
Word
Distr.
APPLE Laptop Reviews
DELL Laptop Reviews
Common Themes
“IBM” specific
“APPLE” specific
“DELL” specific
Battery 0.129
Long 0.120
Reasonable 0.10
Short 0.05
Hours 0.080
4hours 0.010
Medium 0.08
Poor 0.01
Life 0.060
3hours 0.008
2hours 0.002
1hours 0.005
…
…
…
..
Disk 0.015
Large 0.100
Small 0.050
Medium 0.123
IDE 0.010
80GB 0.050
5GB 0.030
20GB 0.080
Drive 0.005
…
...
….
Pentium 0.113
Slow 0.114
Fast 0.151
Moderate 0.116
Processor 0.050
200Mhz 0.080
3Ghz 0.100
1Ghz 0.070
…
…
…
…
..
Collection-specific Word Distributions
MIAS Tutorial Summer 2011
166
Probabilistic Latent Semantic
Analysis/Indexing (PLSA/PLSI) [Hofmann 99]
• Mix k multinomial distributions to generate a
document
• Each document has a potentially different set
of mixing weights which captures the topic
coverage
• When generating words in a document, each
word may be generated using a DIFFERENT
multinomial distribution
• We may add a background distribution to
“attract” background words
167
MIAS Tutorial Summer 2011
PLSA as a Mixture Model
k
p d ( w )   B p ( w |  B )  (1   )   d , j p ( w |  j )
j 1
log p ( d ) 
 c ( w , d ) log [ 
w V
k
B
p ( w |  B )  (1   )   d , j p ( w |  j ) ]
j 1
Document d
Theme 1
warning 0.3
system 0.2..
Theme 2
Aid 0.1
donation 0.05
support 0.02 ..
2
statistics 0.2
loss 0.1
dead 0.05 ..
Background B
Is 0.05
the 0.04
a 0.03 ..
“Generating” word w
in doc d in the collection
d,2
1 - B
d, k
k
…
Theme k
d,1
1
W
B
B
Parameters:
B=noise-level (manually set)
’s and ’s are estimated with Maximum Likelihood
168
MIAS Tutorial Summer 2011
Cross-Collection Mixture Models
•
•
•
•
Explicitly distinguish
and model common
themes and specific
themes
Fit a mixture model
with the text data
Estimate parameters
using EM
Clusters are more
meaningful
C1
C2
Cm
Background B
Theme 1 in common: 1
Theme 1
Specific
to C1
1,1
Theme 1
Specific
to C2
1,2
Theme 1
Specific
to Cm
1,m
…………………
Theme k in common:
… k
Theme k
Specific
to C1
k,1
Theme k
Specific
to C2
k,2
Theme k
Specific
to Cm
k,m
169
MIAS Tutorial Summer 2011
Details of the Mixture Model
Account for noise (common non-informative words)
Background
Common
Distribution 1
B
B
C
“Generating” word w
in doc d in collection Ci
Theme 1
1,i
1-C
Collection-specific
Distr.
d,1
1-B
…
Common
Distribution
Theme k
k
k,i
Collection-specific
Distr.
C
1-C
W
p d ( w | C i )  (1   B ) p ( w |  B )
k
  B   d , j [C p ( w |  j )
j 1
 (1   C ) p ( w |  j , i )]
d,k
Parameters:
B=noise-level (manually set)
C=Common-Specific tradeoff (manually set)
’s and ’s are estimated with Maximum Likelihood
170
MIAS Tutorial Summer 2011
Comparing News Articles
Iraq War (30 articles) vs. Afghan War (26 articles)
The common theme indicates that “United Nations” is involved in both wars
Cluster 1
Common
Theme
Iraq
Theme
Afghan
Theme
united
nations
…
Cluster 2
0.042
0.04
n
0.03
Weapons 0.024
Inspections 0.023
…
Northern 0.04
alliance
0.04
kabul
0.03
taleban
0.025
aid
0.02
…
killed
month
deaths
…
troops
hoon
sanches
…
taleban
rumsfeld
hotel
front
…
Cluster 3
0.035
0.032
0.023
…
0.016
0.015
0.012
…
0.026
0.02
0.012
0.011
…
Collection-specific themes indicate different roles of “United Nations” in the two wars
171
MIAS Tutorial Summer 2011
Comparing Laptop Reviews
Top words serve as “labels” for common themes
(e.g., [sound, speakers], [battery, hours], [cd,drive])
These word distributions can be used to segment text and
add hyperlinks between documents
172
MIAS Tutorial Summer 2011
Additional Results of
Contextual Text Mining
• Spatiotemporal topic pattern analysis
• Theme evolution analysis
• Event impact analysis
• Sentiment summarization
• All results are from Qiaozhu Mei’s dissertation,
available at:
http://www.ideals.illinois.edu/handle/2142/14707
173
MIAS Tutorial Summer 2011
Spatiotemporal Patterns in Blog
Articles
•
•
Query= “Hurricane Katrina”
Topics in the results:
Government Response
bush 0.071
president 0.061
federal 0.051
government 0.047
fema 0.047
administrate 0.023
response 0.020
brown 0.019
blame 0.017
governor 0.014
•
New Orleans
city 0.063
orleans 0.054
new 0.034
louisiana 0.023
flood 0.022
evacuate 0.021
storm 0.017
resident 0.016
center 0.016
rescue 0.012
Oil Price
price 0.077
oil 0.064
gas 0.045
increase 0.020
product 0.020
fuel 0.018
company 0.018
energy 0.017
market 0.016
gasoline 0.012
Praying and Blessing
god 0.141
pray 0.047
prayer 0.041
love 0.030
life 0.025
bless 0.025
lord 0.017
jesus 0.016
will 0.013
faith 0.012
Aid and Donation
donate 0.120
relief 0.076
red 0.070
cross 0.065
help 0.050
victim 0.036
organize 0.022
effort 0.020
fund 0.019
volunteer 0.019
Personal
i 0.405
my 0.116
me 0.060
am 0.029
think 0.015
feel 0.012
know 0.011
something 0.007
guess 0.007
myself 0.006
Spatiotemporal patterns
174
MIAS Tutorial Summer 2011
Theme Life Cycles (“Hurricane Katrina”)
Oil Price
New Orleans
price 0.0772
oil 0.0643
gas 0.0454
increase 0.0210
product 0.0203
fuel 0.0188
company 0.0182
…
city 0.0634
orleans 0.0541
new 0.0342
louisiana 0.0235
flood 0.0227
evacuate 0.0211
storm 0.0177
…
175
MIAS Tutorial Summer 2011
Theme Snapshots (“Hurricane Katrina”)
Week2: The discussion moves towards the north and west
Week1: The theme is the strongest along the Gulf of Mexico
Week3: The theme distributes more uniformly over the states
Week4: The theme is again strong along the east coast and the Gulf of Mexico
Week5: The theme fades out in most states
176
MIAS Tutorial Summer 2011
Theme Life Cycles (KDD Papers)
Normalized Strength of Theme
0.02
Biology Data
0.018
Web Information
0.016
Time Series
0.014
Classification
Association Rule
0.012
Clustering
0.01
Bussiness
0.008
0.006
0.004
0.002
0
1999
2000
2001
2002
2003
2004
gene 0.0173
expressions 0.0096
probability 0.0081
microarray 0.0038
…
marketing 0.0087
customer 0.0086
model 0.0079
business 0.0048
…
rules 0.0142
association 0.0064
support 0.0053
…
Time (year)
177
MIAS Tutorial Summer 2011
Theme Evolution Graph: KDD
1999
2000
2001
2002
SVM 0.007
criteria 0.007
classifica –
tion
0.006
linear 0.005
…
decision 0.006
tree
0.006
classifier 0.005
class
0.005
Bayes
0.005
…
web 0.009
classifica –
tion 0.007
features0.006
topic 0.005
…
2003
mixture 0.005
random 0.006
cluster 0.006
clustering 0.005
variables 0.005
…
…
…
…
Classifica
- tion
text
unlabeled
document
labeled
learning
…
0.015
0.013
0.012
0.008
0.008
0.007
…
Informa
- tion 0.012
web
0.010
social 0.008
retrieval 0.007
distance 0.005
networks 0.004
…
2004
T
topic 0.010
mixture 0.008
LDA 0.006
semantic
0.005
…
178
MIAS Tutorial Summer 2011
Aspect Sentiment Summarization
Query: “Da Vinci Code”
Neutral
Positive
Negative
... Ron Howards selection
of Tom Hanks to play
Robert Langdon.
Tom Hanks stars in
the movie,who can be
mad at that?
But the movie might get
delayed, and even killed off
if he loses.
Topic 1: Directed by: Ron Howard
Writing credits: Akiva
Movie
Goldsman ...
After watching the movie I
went online and some
research on ...
I remembered when i first
read the book, I finished
Topic 2: the book in two days.
I’m reading “Da Vinci
Book
Code” now.
…
Tom Hanks, who is my protesting ... will lose your
favorite movie star act faith by watching the movie.
the leading role.
Anybody is interested
in it?
... so sick of people making
such a big deal about a
FICTION book and movie.
Awesome book.
... so sick of people making
such a big deal about a
FICTION book and movie.
So still a good book to
past time.
This controversy book
cause lots conflict in west
society.
179
MIAS Tutorial Summer 2011
Separate Theme Sentiment
Dynamics
“book”
“religious beliefs”
180
MIAS Tutorial Summer 2011
Event Impact Analysis: IR Research
Theme: retrieval
models
term
0.1599
relevance
0.0752
weight
0.0660
feedback
0.0372
independence 0.0311
model
0.0310
frequent
0.0233
probabilistic 0.0188
document
0.0173
…
vector
concept
extend
model
space
boolean
function
feedback
…
0.0514
0.0298
0.0297
0.0291
0.0236
0.0151
0.0123
0.0077
1992
xml
email
model
collect
judgment
rank
subtopic
…
0.0678
0.0197
0.0191
0.0187
0.0102
0.0097
0.0079
SIGIR papers
Publication of the paper “A language modeling
approach to information retrieval”
Starting of the TREC conferences
1998
probabilist 0.0778
model
0.0432
logic
0.0404
ir
0.0338
boolean 0.0281
algebra 0.0200
estimate 0.0119
weight
0.0111
…
MIAS Tutorial Summer 2011
year
model
0.1687
language 0.0753
estimate 0.0520
parameter 0.0281
distribution 0.0268
probable
0.0205
smooth
0.0198
markov
0.0137
likelihood 0.0059
…
181
Topic Evoluation Graph (KDD Papers)
1999
2000
KDD
decision
tree
classifier
class
Bayes
…
2001
2002
SVM 0.007
criteria 0.007
classifica –
tion
0.006
linear 0.005
…
0.006
0.006
0.005
0.005
0.005
web 0.009
classification
0.007
features0.006
topic 0.005
…
2003
mixture 0.005
random 0.006
cluster 0.006
clustering 0.005
variables 0.005
…
…
…
…
classification
0.015
text
0.013
unlabeled 0.012
document 0.008
labeled
0.008
learning 0.007
…
information
0.012
web
0.010
social 0.008
retrieval 0.007
distance 0.005
networks 0.004
…
2004
T
topic 0.010
mixture 0.008
LDA 0.006
semantic
0.005
…
182
MIAS Tutorial Summer 2011
Part 3: Web Information Access
-3.1 Overview of web information management
-3.2 Search engine technologies
-3.3 Next-generation search engines
183
MIAS Tutorial Summer 2011
Part 3.1: Overview of Web
Information Management
184
MIAS Tutorial Summer 2011
General Challenges in
Web Information Management
• Handling the size of the Web
– How to ensure completeness of coverage?
– Efficiency issues
• Dealing with or tolerating errors and low
quality information
• Addressing the dynamics of the Web
– Some pages may disappear permanently
– New pages are constantly created
185
MIAS Tutorial Summer 2011
Tasks in Web Information Management
• Information Access
– Search (Search engines, e.g. Google)
– Navigation (Browsers, e.g. IE)
– Filtering (Recommender systems, e.g., Amazon)
• Information Organization
– Categorization (Web directories, e.g., Yahoo!)
– Clustering (Organize search results, e.g., Vivsimo)
• Information Discovery (Mining, e.g. ??)
186
MIAS Tutorial Summer 2011
So far, Web information access is
mainly through search engines…
There are MANY search engines on
the Web…
187
MIAS Tutorial Summer 2011
Typology of Web Search Engines
Web Browser
Yahoo!
Directory
Service
Google
Vivisimo
Search
Engine
Meta
Engine
Web Index
Engine
profiles
Directories
Spider/Robot
MIAS Tutorial Summer 2011
188
Examples of Web Search Engines
Web Browser
Directory Service
Meta Engines
Autonomous Engines
189
MIAS Tutorial Summer 2011
Part 3.2: Search Engine
Technologies
-Structured Text Retrieval Models
-Web Search Engines
190
MIAS Tutorial Summer 2011
Structured Text Retrieval
Models
191
MIAS Tutorial Summer 2011
“Free text” vs. “Structured text”
• So far, we’ve assumed “free text”
– Document = word sequence
– Query = word sequence
– Collection = a set of documents
– Minimal structure …
• But, we may have structures on text (e.g., title,
hyperlinks)
– Can we exploit the structures in retrieval?
– Sometimes, structures may be more important than
the text (moving closer to relational databases … )
192
MIAS Tutorial Summer 2011
Examples of Document Structures
• Intra-doc structures (=relations of
components)
– Natural components: title, author, abstract,
sections, references, …
– Annotations: named entities, subtopics, markups,
…
• Inter-doc structures (=relations between
documents)
– Topic hierarchy
– Hyperlinks/citations (hypertext)
193
MIAS Tutorial Summer 2011
Structured Text Collection
General question: How do we search such a collection?
A general topic
Subtopic k
Subtopic 1
...
194
MIAS Tutorial Summer 2011
Challenges in
Structured Text Retrieval (STR)
•
Challenge 1 (Information/Doc Unit): What is an
appropriate information unit?
– Document may no longer be the most natural unit
– Components in a document or a whole Web site may be more
appropriate
•
Challenge 2 (Query): What is an appropriate query
language?
– Keyword (free text) query is no longer the only choice
– Constraints on the structures can be posed (e.g., search in a
particular field or matching a URL domain)
195
MIAS Tutorial Summer 2011
Challenges in STR (cont.)
• Challenge 3 (Retrieval Model): What is an
appropriate model for STR?
– Ranking vs. selection
• Boolean model may be more powerful with structures (e.g.,
domain constraints in Web search)
• But, ranking may still be desirable
– Multiple (potentially conflicting) preferences
• Text-based ranking preferences (traditional IR)
• Structure-based ranking preferences (e.g., PageRank)
• Structure-based constraints (traditional DB)
• How to combine them?
196
MIAS Tutorial Summer 2011
Challenges in STR (cont.)
• Challenge 4 (Learning): How to learn a user’s
information need over time or from examples?
– How to support relevance feedback?
– What about pseudo feedback?
– How to exploit structures in text mining
(unsupervised learning)
• Challenge 5 (Efficiency): How to index and
search a structured text collection efficiently?
– Time and space
– Extensibility
197
MIAS Tutorial Summer 2011
These challenges are mostly still
open questions….
We now look at some retrieval
models that exploit intradocument and inter-document
structures….
198
MIAS Tutorial Summer 2011
Exploiting Intra-document Structures
[Ogilvie & Callan 2003]
D
Title
D1
Abstract
D2
Body-Part1
D3
Body-Part2
Intuitively, we want to combine all the
parts, but give more weights to some parts
Think about query-likelihood model…
Select Dj and generate a
query word using Dj
n
p (Q | D , R ) 

p ( wi | D , R )
i 1
…
n

  s(D
i 1
Dk
k
j
| D , R ) p ( wi | D j , R )
j 1
“part selection” prob. Serves as weight for Dj
Can be trained using EM
Anchor text can be treated as a “part” of a document
199
MIAS Tutorial Summer 2011
Exploiting Inter-document Structures
• Document collection has links (e.g., Web,
citations of literature)
• Query: text query
• Results: ranked list of documents
• Challenge: how to exploit links to improve
ranking?
200
MIAS Tutorial Summer 2011
Exploiting Inter-Document Links
“Extra text”/summary for a doc
Description
(“anchor text”)
Links indicate the utility of a doc
Hub
What does a link tell us?
MIAS Tutorial Summer 2011
Authority
201
PageRank: Capturing Page
“Popularity”
[Page & Brin 98]
•
•
•
Intuitions
– Links are like citations in literature
– A page that is cited often can be expected to be more useful
in general
PageRank is essentially “citation counting”, but
improves over simple counting
– Consider “indirect citations” (being cited by a highly cited
paper counts a lot…)
– Smoothing of citations (every page is assumed to have a nonzero citation count)
PageRank can also be interpreted as random surfing
(thus capturing popularity)
202
MIAS Tutorial Summer 2011
The PageRank Algorithm (Page et al. 98)
Random surfing model:
At any page,
With prob. , randomly jumping to a page
With prob. (1-), randomly picking a link to follow.
0 1/ 2
0

1
0
0
M  
0
1
0

1 / 2 1 / 2 0
d1
d3
d2
d4
1/ 2

0

0 “Transition

0 
N
N
k 1
k 1
p ( d i )  (1   )  m ki p ( d k )   
N
[
k 1
1
N
1
N
matrix”
p (d k )
  (1   ) m ki ] p ( d k )
p  ( I  (1   ) M ) p
T
Iij = 1/N
Initial value p(d)=1/N
N= # pages
Stationary (“stable”)
distribution, so we
ignore time
Iterate until converge
203
MIAS Tutorial Summer 2011
PageRank in Practice
•
Interpretation of the damping factor  (0.15):
– Probability of a random jump
– Smoothing the transition matrix (avoid zero’s)
N
N
p ( d i )  (1   )  m ki p ( d k )   
k 1
•
k 1
1
N
p (d k )
p  ( I  (1   ) M ) p
T
Normalization doesn’t affect ranking, leading to some
variants
N
N
k 1
k 1
 p '( d i )  (1   )  m ki p '( d k )   
CN
N
N
k 1
k 1
cp ( d i )  (1   )  m ki cp ( d k )   
L et p '( d i )  cp ( d i ), c  constant ,
1
N
1
N
c p (d k )
p '( d k )
N
 p '( d i )  (1   )  m ki p '( d k )  
k 1
•
The zero-outlink problem: p(di)’s don’t sum to 1
– One possible solution = page-specific damping factor
(=1.0 for a page with no outlink)
204
MIAS Tutorial Summer 2011
HITS: Capturing Authorities & Hubs
[Kleinberg 98]
• Intuitions
– Pages that are widely cited are good authorities
– Pages that cite many other pages are good hubs
• The key idea
of HITS
– Good authorities are cited by good hubs
– Good hubs point to good authorities
– Iterative reinforcement…
205
MIAS Tutorial Summer 2011
The HITS Algorithm [Kleinberg 98]
d1
d3
d2
0

1

A
0

1
0 1 1

0 0 0

1 0 0

1 0 0
h(d i ) 

“Adjacency matrix”
Initial values: a=h=1
a (d j )
d j O U T ( d i )
d4

a (d i ) 
Iterate
h(d j )
d j  IN ( d i )
h  Aa ;
a A h
T
h  AA h ; a  A Aa
T
Normalize:
T

i
2
a (d i ) 

2
h(d i )  1
i
206
MIAS Tutorial Summer 2011
Web Search Engines
207
MIAS Tutorial Summer 2011
Basic Search Engine Technologies
User
…
Web
Browser
Query
Host Info.
Efficiency!!!
Results
Retriever
Precision
Crawler Coverage
Freshness
Cached
pages
Indexer
------…
-------
Error/spam handling
…
------…
-------
(Inverted) Index
MIAS Tutorial Summer 2011
208
Component I: Crawler/Spider/Robot
•
Building a “toy crawler” is easy
– Start with a set of “seed pages” in a priority queue
– Fetch pages from the web
– Parse fetched pages for hyperlinks; add them to the queue
– Follow the hyperlinks in the queue
•
A real crawler is much more complicated…
– Robustness (server failure, trap, etc.)
– Crawling courtesy (server load balance, robot exclusion, etc.)
– Handling file types (images, PDF files, etc.)
– URL extensions (cgi script, internal references, etc.)
– Recognize redundant pages (identical and duplicates)
– Discover “hidden” URLs (e.g., truncated)
•
Crawling strategy is a main research topic (i.e., which page to visit
next?)
209
MIAS Tutorial Summer 2011
Major Crawling Strategies
•
•
•
Breadth-First (most common(?); balance server load)
Parallel crawling
Focused crawling
– Targeting at a subset of pages (e.g., all pages about
“automobiles” )
– Typically given a query
•
Incremental/repeated crawling
– Can learn from the past experience
– Probabilistic models are possible
The Major challenge remains to maintain “freshness”
and good coverage with minimum resource overhead
210
MIAS Tutorial Summer 2011
Component II: Indexer
•
Standard IR techniques are the basis
– Make basic indexing decisions (stop words, stemming, numbers,
special symbols)
•
– Ensure indexing efficiency (space and time)
– Updating
Major additional challenges
– Scaling up!
– Recognize spams/junks
•
– How to support scoring with multiple features (PageRank, font
information, structures, etc)
Google’s contributions:
– Google file system: distributed file system
– Big Table: column-based database
– MapReduce: Software framework for parallel computation
– Hadoop: Open source implementation of MapReduce (used in
Yahoo!)
MIAS Tutorial Summer 2011
211
Google’s Solutions
URL
Queue/List
Cached source pages
(compressed)
Inverted index
Hypertext
structure
Use many
features,
e.g. font,
layout,…
212
MIAS Tutorial Summer 2011
Google’s Contributions
• Distributed File System (GFS)
• Column-based Database (Big Table)
• Parallel programming framework (MapReduce)
213
MIAS Tutorial Summer 2011
Google File System: Overview
• Motivation: Input data is large (whole Web, billions of
pages), can’t be stored on one machine
• Why not use the existing file systems?
– Network File System (NFS) has many deficiencies ( network
congestion, single-point failure)
– Google’s problems are different from anyone else
• GFS is designed for Google apps and
workloads.
– GFS demonstrates how to support large scale processing
workloads on commodity hardware
– Designed to tolerate frequent component failures.
– Optimized for huge files that are mostly appended and read.
– Go for simple solutions.
214
MIAS Tutorial Summer 2011
GFS Assumptions
• High Component failure rates
– Inexpensive commodity components fail all the
time.
• Modest number of huge files.
– Just a few million
– Each is 100 MB or larger: multi GB files typically
•
Files are write once ,mostly appended to
– Perhaps Concurrently
• Large streaming reads.
215
MIAS Tutorial Summer 2011
GFS Design Decisions
• Files are stored as chunks.
- Fixed size(64 MB).
• Reliability through replication.
- Each chunk is replicated across 3+ chunkservers
•
Single master to co ordinate access, keep
metadata
- Simple centralized management.
• No data caching
- Little benefit due to large datasets, streaming reads.
216
MIAS Tutorial Summer 2011
GFS Architecture
217
MIAS Tutorial Summer 2011
MapReduce
• Provide easy but general model for programmers to use
cluster resources
• Hide network communication (i.e. Remote Procedure Calls)
• Hide storage details, file chunks are automatically distributed
and replicated
• Provide transparent fault tolerance (Failed tasks are
automatically rescheduled on live nodes)
• High throughput and automatic load balancing (E.g.
scheduling tasks on nodes that already have data)
This slide and the following slides about MapReduce are from Behm & Shah’s presentation
http://www.ics.uci.edu/~abehm/class_reports/uci/2008-Spring_CS224/Behm-Shah_PageRank.ppt
218
MIAS Tutorial Summer 2011
MapReduce Flow
Input
= Key,Value
Key,Value
…
Map
Map
Map
Key,Value
Key,Value
Key,Value
Key,Value
Key,Value
Key,Value
…
…
…
Sort
Reduce(K,V[ ])
Output
= Key,Value
Key,Value
…
Split Input into
Key-Value pairs.
For each K-V
pair call Map.
Each Map
produces new set
of K-V pairs.
For each distinct
key, call reduce.
Produces one K-V
pair for each
distinct key.
Output as a set
of Key Value
Pairs.
219
MIAS Tutorial Summer 2011
MapReduce WordCount Example
Output:
Number of occurrences
of each word
Input:
File containing words
Hello World Bye World
Hello Hadoop Bye
Hadoop
Bye Hadoop Hello
Hadoop
MapReduce
Bye 3
Hadoop 4
Hello 3
World 2
How can we do this within the MapReduce framework?
Basic idea: parallelize on lines in input file!
220
MIAS Tutorial Summer 2011
MapReduce WordCount Example
Input
Map Output
1, “Hello World Bye World”
<Hello,1>
<World,1>
<Bye,1>
<World,1>
Map
2, “Hello Hadoop Bye Hadoop”
3, “Bye Hadoop Hello Hadoop”
Map
Map
<Hello,1>
<Hadoop,1>
<Bye,1>
<Hadoop,1>
Map(K,V) {
For each word w in V
Collect(w, 1);
}
<Bye,1>
<Hadoop,1>
<Hello,1>
<Hadoop,1>
221
MIAS Tutorial Summer 2011
MapReduce WordCount Example
Reduce(K,V[ ]) {
Int count = 0;
For each v in V
count += v;
Collect(K, count);
}
Map Output
<Hello,1>
<World,1>
<Bye,1>
<World,1>
<Hello,1>
<Hadoop,1>
<Bye,1>
<Hadoop,1>
<Bye,1>
<Hadoop,1>
<Hello,1>
<Hadoop,1>
Internal Grouping
<Bye  1, 1, 1>
<Hadoop  1, 1, 1, 1>
Reduce
Reduce
<Hello  1, 1, 1>
Reduce
<World  1, 1>
Reduce
Reduce Output
<Bye, 3>
<Hadoop, 4>
<Hello, 3>
<World, 2>
222
MIAS Tutorial Summer 2011
Inverted Indexing with MapReduce
D1: java resource java class
Key
java
resource
class
Map
Value
(D1, 2)
(D1, 1)
(D1,1)
D2: java travel resource
Key
java
travel
resource
D3: …
Value
(D2, 1)
(D2,1)
(D2,1)
Built-In Shuffle and Sort: aggregate values by keys
Reduce
Key
java
resource
class
travel
…
Value
{(D1,2), (D2, 1)}
{(D1, 1), (D2,1)}
{(D1,1)}
Slide adapted from Jimmy Lin’s presentation
{(D2,1)}
MIAS Tutorial Summer 2011
223
Inverted Indexing: Pseudo-Code
Slide adapted from Jimmy Lin’s presentation
224
MIAS Tutorial Summer 2011
Component III: Retriever
•
Standard IR models apply but aren’t sufficient
– Different information need (home page finding vs. topic-driven)
– Documents have additional information (hyperlinks, markups, URL)
– Information is often redundant and the quality varies a lot
– Server-side feedback is often not feasible
•
Major extensions
– Exploiting links (anchor text, link-based scoring)
– Exploiting layout/markups (font, title field, etc.)
– Spelling correction
– Spam filtering
– Redundancy elimination
•
In general, rely on machine learning to combine all kinds of features
225
MIAS Tutorial Summer 2011
Effective Web Retrieval Heuristics
•
High accuracy in home page finding can be achieved
by
– Matching query with the title
– Matching query with the anchor text
– Plus URL-based or link-based scoring (e.g. PageRank)
•
Imposing a conjunctive (“and”) interpretation of the
query is often appropriate
– Queries are generally very short (all words are necessary)
– The size of the Web makes it likely that at least a page would
match all the query words
•
Combine multiple features using machine learning
226
MIAS Tutorial Summer 2011
Home/Entry Page Finding Evaluation Results
(TREC 2001)
MRR
0.774
0.772
%top10
88.3
87.6
%fail
4.8
4.8
Unigram Query Likelihood
+ Link/URL prior
i.e., p(Q|D) p(D)
[Kraaij et al. SIGIR 2002]
Exploiting anchor text,
structure or links
MRR
0.382
%top10
62.1
%fail
11.7
0.340
60.7
15.9
Query example:
Haas Business School
227
MIAS Tutorial Summer 2011
Named Page Finding Evaluation Results
(TREC 2002)
Okapi/BM25
+ Anchor Text
Dirichlet Prior
+ Title, Anchor Text
(Lemur)
[Ogilvie & Callan
SIGIR 2003]
Best content-only
Query example: America’s century farms
MIAS Tutorial Summer 2011
228
How can we combine many
features? (Learning to Rank)
• General idea:
– Given a query-doc pair (Q,D), define various kinds
of features Xi(Q,D)
– Examples of feature: the number of overlapping
terms, BM25 score of Q and D, p(Q|D), PageRank of
D, p(Q|Di), where Di may be anchor text or big font
text, “does the URL contain ‘~’?”….
– Hypothesize p(R=1|Q,D)=s(X1(Q,D),…,Xn(Q,D), )
where  is a set of parameters
– Learn  by fitting function s with training data, i.e.,
3-tuples like (D, Q, 1) (D is relevant to Q) or (D,Q,0)
(D is non-relevant to Q)
229
MIAS Tutorial Summer 2011
Regression-Based Approaches
(Fuhr 98, Cooper 92, Gey 94)
Logistic Regression: Xi(Q,D) is feature; ’s are parameters
log
P(R  1 | Q, D)
1  P(R  1 | Q, D)
n
 0 

i
Xi
i 1
1
P(R  1 | Q, D) 
n
1  exp(   0    i X i )
i 1
p ({( Q , D 1 ,1), ( Q , D 2 , 0 )}) 
*
  arg max


Estimate ’s by maximizing the likelihood o
training data
D1 (R=1)
D2 (R=0)
X1(Q,D)
BM25
0.7
0.3
1
1  exp(   0  0 . 7  1  0 . 11  2  0 . 65  3 )
* (1 
X2 (Q,D) X3(Q,D)
PageRank BM25Anchor
0.11
0.65
0.05
0.4
1
1  exp(   0  0 . 3  1  0 . 05  2  0 . 4  3 )
p ({( Q 1 , D 11 , R11 ), ( Q 1 , D 12 , R12 ),...., ( Q n , D m 1 , R m 1 ),...})
Once ’s are known, we can take Xi(Q,D) computed based
on a new query and a new document to generate a score for
D w.r.t. Q.
230
MIAS Tutorial Summer 2011
)
Features used in early logistic
regression work
X1 
QAF t j
1
M
1
M
 log
DAF t j
M
IDF 
Average Absolute Document Frequency
1
Document Length
DL
1
Average Absolute Query Frequency over all the M
matched terms
Query Length
QL
X4 
X5 
 log
M
X2 
X3 
M
1
M
 log
IDF t j
Average Inverse Document Frequency
1
N  nt j
nt j
X 6  log M
Inverse Document Frequency
Number of Terms in common between query and
document -- logged
This model wasn’t very successful because it tries to “re-invent” the
Vector Space model instead of leveraging it as the current models do
231
MIAS Tutorial Summer 2011
•
Machine Learning Approaches:
Pros & Cons
Advantages
– Principled way to combine multiple features (helps improve
accuracy and combat web spams)
– May re-use all the past relevance judgments (self-improving)
•
Problems
– Performance is very sensitive to the selection of features
– No much guidance on feature generation (rely on traditional
retrieval models)
•
In practice, they are adopted in all current Web search
engines
232
MIAS Tutorial Summer 2011
Part 3.3 Next-Generation
Search Engines
233
MIAS Tutorial Summer 2011
Next Generation Search Engines
•
More specialized/customized (vertical search engines)
– Special group of users (community engines, e.g., Citeseer)
– Personalized (better understanding of users)
– Special genre/domain (better understanding of documents)
•
•
•
•
Learning over time (evolving)
Integration of search, navigation, and
recommendation/filtering (full-fledged information
management)
Beyond search to support tasks (e.g., shopping)
Many opportunities for innovations!
234
MIAS Tutorial Summer 2011
The Data-User-Service (DUS)
Lawyers Triangle
Scientists
UIUC employees
Online shoppers
…
Data
Web pages
News articles
Blog articles
Literature
Email
…
Users
Search
Browsing
Mining
Task support, …
Services
235
MIAS Tutorial Summer 2011
Millions of Ways to
Connect the DUS Triangle!
Everyone
… Scientists
UIUC
Employees
Web pages
Literature
Customer
Service
People
Web Search
Literature
Assistant
Enterprise
Opinion
Search
Advisor
Customer
Rel. Man.
Organization docs
Blog articles
Product reviews
…
Online
Shoppers
Customer emails
Search
Browsing
Alert
Mining
MIAS Tutorial Summer 2011
…
Task/Decision
support 236
Where Do We Want to Be?
Task Support
Full-Fledged
Mining Text
Info. Management
Access
Search
Current Search Engine
Keyword Queries
Search History
Personalization
Complete User Model
(User Modeling)
Bag of words
Entities-Relations
Large-Scale
SemanticKnowledge
Analysis
Representation
(Vertical Search
Engines)
237
MIAS Tutorial Summer 2011
Limitations of the
Current Search Engines
•
•
•
Limited query language
–
–
Limited understanding of document contents
–
•
•
•
Bag of words & keyword matching (sense disambiguation?)
Heuristic query-document matching: mostly TF-IDF weighting
–
–
•
Syntactic querying (sense disambiguation?)
Can’t express multiple search criteria (readability?)
No guarantee for optimality
Machine learning can combine many features, but content matching remains the most
important component in scoring
Lack of user/context modeling
–
–
Using the same query, different users would get the same results (poor user modeling)
The same user may use the same query to find different information at different times
Inadequate support for multiple-mode information access
–
–
–
Passive search support: A user must take initiative (no recommendation)
Static navigation support: No dynamically generated links
Consider more integration of search, recommendation, and navigation
Lack of interaction
Lack of task support
238
MIAS Tutorial Summer 2011
Characteristics of the
Next-Generation Search Engines
•
•
Better support for query formulation
–
–
–
Better search accuracy
–
–
•
•
•
•
More accurate information need understanding (more personalization and context modeling)
More accurate document content understanding (more powerful content analysis, sense
disambiguation, sentiment analysis, …)
More complex retrieval criteria
–
•
Allow querying from any task context
Query by examples
Automatic query generation (recommendation)
–
Consider multiple utility aspects of information items (e.g., readability, quality,
communication cost)
Consider collective value of information items (context-sensitive ranking)
Better result presentation
–
–
Better organization of search results to facilitate navigation
Better summarization
More effective and robust retrieval models
–
Automatic parameter tuning
More interactive search
More task support
239
MIAS Tutorial Summer 2011
Looking Ahead….
•
•
•
•
More user modeling
– Personalized search
– Community search engine (collaborative information access)
More content analysis and domain modeling
– Vertical search engines
– More in-depth (domain-specific) natural language understanding
– Text mining
More accurate retrieval models (life-time learning)
Going beyond search
– Towards full-fledge information access: integration of search,
recommendation, and navigation
– Towards task support: putting information access in the context
of a task
240
MIAS Tutorial Summer 2011
Check out cs410 website
http://times.cs.uiuc.edu/course/410s11/
for assignments and additional lectures
241
MIAS Tutorial Summer 2011
Descargar

A Risk Minimization Framework for Information Retrieval