龙星计划课程:信息检索
Natural Language Processing for IR
ChengXiang Zhai (翟成祥)
Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology, Statistics
University of Illinois, Urbana-Champaign
http://www-faculty.cs.uiuc.edu/~czhai, [email protected]
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
1
Elements of Text Info Management
Technologies
Retrieval
Applications
Visualization
Summarization
Filtering
Information
Access
Mining
Applications
Mining
Information
Organization
Search
Categorization
Extraction
Knowledge
Acquisition
Clustering
Natural Language Content Analysis
Text
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
2
Overview of NLP for IR
• Most work has focused on improving the indexing
unit
– Syntactic phrases
– Word sense disambiguation
– Anaphora resolution
• Dependency parsing
• Sentiment retrieval
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
3
Improving the Indexing Unit
The following slides are taken/adapted from Jimmy Lin’s presentation
http://umiacs.umd.edu/~resnik/ling647_sp2006/slides/jimmy_ir_lecture.ppt
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
4
The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
5
Why is IR hard?
• IR is hard because natural language is so rich
(among other reasons)
• What are the issues?
– Tokenization
– Morphological Variation
– Synonymy
– Polysemy
– Paraphrase
– Ambiguity
– Anaphora
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
6
Possible Solutions
• Vary the unit of indexing
– Strings and segments
– Tokens and words
– Phrases and entities
– Senses and concepts
• Manipulate queries and results
– Term expansion
– Post-processing of results
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
7
Tokenization
• What’s a word?
– First try: words are separated by spaces
The cat on the mat.
the, cat, on, the, mat
– What about clitics?
I’m not saying that I don’t want John’s input on this.
• What about languages without spaces?
天主教教宗若望保祿二世因感冒再度住進醫院。
天主教 教宗 若望保祿二世 因 感冒 再度 住進 醫院。
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
8
Word-Level Issues
•
Morphological variation = different forms of the same concept
– Inflectional morphology: same part of speech
break, broke, broken; sing, sang, sung; etc.
– Derivational morphology: different parts of speech
destroy, destruction; invent, invention, reinvention; etc.
•
Synonymy
= different words, same meaning
{dog, canine, doggy, puppy, etc.}  concept of dog
•
Polysemy
= same word, different meanings (e.g., root)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
9
Paraphrase
• Language provides different ways of saying the
same thing
Who killed Abraham Lincoln?
(1) John Wilkes Booth killed Abraham Lincoln.
(2) John Wilkes Booth altered history with a bullet. He will forever be
known as the man who ended Abraham Lincoln’s life.
When did Wilt Chamberlain score 100 points?
(1) Wilt Chamberlain scored 100 points on March 2, 1962 against the
New York Knicks.
(2) On December 8, 1961, Wilt Chamberlain scored 78 points in a triple
overtime game. It was a new NBA record, but Warriors coach Frank
McGuire didn’t expect it to last long, saying, “He’ll get 100 points
someday.” McGuire’s prediction came true just a few months later in
a game against the New York Knicks on March 2.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
10
Ambiguity
• What exactly do you mean?
I saw the man on the hill with the telescope?
Who has the telescope?
Time flies like an arrow.
Say what?
Visiting relatives can be annoying.
Who’s visiting?
• Why don’t we have problems (most of the time)?
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
11
Ambiguity in Action
• Different documents with the same keywords may
have different meanings…
What do frogs eat?
keywords: frogs, eat
(1)
(2)
(3)
Adult frogs eat mainly insects
and other small animals,
including earthworms, minnows,
and spiders.
Alligators eat many kinds of
small animals that live in or near
the water, including fish,
snakes, frogs, turtles, small
mammals, and birds.
Some bats catch fish with their
claws, and a few species eat
lizards, rodents, small birds,
tree frogs, and other bats.
2008 © ChengXiang Zhai
What is the largest volcano in the
Solar System?
keywords: largest, volcano, solar, system
(1)
(2)
(3)
Mars boasts many extreme
geographic features; for example,
Olympus Mons, is the largest
volcano in the solar system.
The Galileo probe's mission to
Jupiter, the largest planet in the
Solar system, included amazing
photographs of the volcanoes on Io,
one of its four most famous moons.
Even the largest volcanoes found
on Earth are puny in comparison to
others found around our own cosmic
backyard, the Solar System.
Dragon Star Lecture at Beijing University, June 21-30, 2008
12
Anaphora
• Language provides different ways of referring to the
same entity
Who killed Abraham Lincoln?
(1) John Wilkes Booth killed Abraham Lincoln.
(2) John Wilkes Booth altered history with a bullet. He will forever be
known as the man who ended Abraham Lincoln’s life.
When did Wilt Chamberlain score 100 points?
(1) Wilt Chamberlain scored 100 points on March 2, 1962 against the
New York Knicks.
(2) On December 8, 1961, Wilt Chamberlain scored 78 points in a triple
overtime game. It was a new NBA record, but Warriors coach Frank
McGuire didn’t expect it to last long, saying, “He’ll get 100 points
someday.” McGuire’s prediction came true just a few months later in
a game against the New York Knicks on March 2.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
13
• Terminology
More Anaphora
– Anaphor = an expression that refers to another
– Anaphora = the phenomenon
• Other different types of referring expressions:
Fujitsu and NEC said they were still investigating, and that
knowledge of more such bids could emerge... Other major
Japanese computer companies contacted yesterday said
they have never made such bids.
The hotel recently went through a $200 million restoration…
original artworks include an impressive collection of Greek
statues in the lobby.
• Anaphora resolution can be hard!
The city council denied the demonstrators a permit because…
they feared violence.
they advocated violence.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
14
What can we do?
• Here are the some of the problems:
– Tokenization
– Morphological variation, synonymy, polysemy
– Paraphrase, ambiguity
– Anaphora
• General approaches:
– Vary the unit of indexing
– Manipulate queries and results
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
15
What do we index?
• In information retrieval, we are after the concepts
represented in the documents
• … but we can only index strings
• So what’s the best unit of indexing?
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
16
The Tokenization Problem
• In many languages, words are not separated by
spaces…
• Tokenization = separating a string into “words”
• Simple greedy approach:
– Start with a list of every possible term (e.g., from a
dictionary)
– Look for the longest word in the unsegmented string
– Take longest matching term as the next word and
repeat
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
17
Probabilistic Segmentation
• For an input word: c1 c2 c3 … cn
• Try all possible partitions:
c1 c2 c3 c4 … cn
c1 c2 c3 c4 … cn
c 1 c 2 c 3 c 4 … cn
…
• Choose the highest probability partition
– E.g., compute P(c1 c2 c3) using a language model
• Challenges: search, probability estimation
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
18
Indexing N-Grams
• Consider a Chinese document: c1 c2 c3 … cn
• Don’t segment (you could be wrong!)
• Instead, treat every character bigram as a term
c 1 c 2 c 3 c 4 c 5 … cn
c1 c2
c2 c3
c3 c4
c4 c5 … cn-1 cn
• Break up queries the same way
• Works at least as well as trying to segment correctly!
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
19
Morphological Variation
• Handling morphology: related concepts have
different forms
– Inflectional morphology: same part of speech
dogs = dog + PLURAL
broke = break + PAST
– Derivational morphology: different parts of speech
destruction = destroy + ion
• Different morphological processes:
researcher = research + er
– Prefixing
– Suffixing
– Infixing
– Reduplication
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
20
Stemming
• Dealing with morphological variation: index stems
instead of words
– Stem: a word equivalence class that preserves the
central concept
• How much to stem?
– organization  organize  organ?
– resubmission  resubmit/submission  submit?
– reconstructionism?
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
21
Does Stemming Work?
• Generally, yes! (in English)
– Helps more for longer queries
– Lots of work done in this area
Donna Harman (1991) How Effective is Suffixing? Journal of the
American Society for Information Science, 42(1):7-15.
Robert Krovetz. (1993) Viewing Morphology as an Inference Process.
Proceedings of SIGIR 1993.
David A. Hull. (1996) Stemming Algorithms: A Case Study for Detailed
Evaluation. Journal of the American Society for Information Science,
47(1):70-84.
And others…
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
22
Stemming in Other Languages
• Arabic makes frequent use of infixes
the root ktb
maktab (office),
kitaab (book),
kutub (books),
kataba (he wrote),
naktubu (we write),
etc.
• What’s the most effective stemming strategy in
Arabic? Open research question…
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
23
Words = wrong indexing unit!
• Synonymy = different words, same meaning
{dog, canine, doggy, puppy, etc.}  concept of dog
• Polysemy
= same word, different meanings
Bank: financial institution or side of a river?
Crane: bird or construction equipment?
• It’d be nice if we could index concepts!
– Word sense: a coherent cluster in semantic space
– Indexing word senses achieves the effect of
conceptual indexing
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
24
Indexing Word Senses
• How does indexing word senses solve the
synonym/polysemy problem?
{dog, canine, doggy, puppy, etc.}  concept 112986
I deposited my check in the bank.
I saw the sailboat from the bank.
bank  concept 76529
bank  concept 53107
• Okay, so where do we get the word senses?
– WordNet
– Automatically find “clusters” of words that describe
the same concepts
– Other methods also have been tried…
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
25
Word Sense Disambiguation
• Given a word in context, automatically determine the
sense (concept)
– This is the Word Sense Disambiguation (WSD)
problem
• Context is the key:
– For each ambiguous word, note the surrounding
words
bank {river, sailboat, water, etc.}  side of a river
bank {check, money, account, etc.}  financial institution
– Learn a classifier from a collection of examples
– Use the classifier to determine the senses of words in
the documents
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
26
• Nope!
Does it work?
Ellen M. Voorhees. (1993) Using WordNet to Disambiguate Word
Senses for Text Retrieval. Proceedings of SIGIR 1993.
Mark Sanderson. (1994) Word-Sense Disambiguation and Information
Retrieval. Proceedings of SIGIR 1994
And others…
• Examples of limited success….
Hinrich Schütze and Jan O. Pedersen. (1995) Information Retrieval
Based on Word Senses. Proceedings of the 4th Annual Symposium
on Document Analysis and Information Retrieval.
Rada Mihalcea and Dan Moldovan. (2000) Semantic Indexing Using
WordNet Senses. Proceedings of ACL 2000 Workshop on Recent
Advances in NLP and IR.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
27
Why Disambiguation Hurts
• Bag-of-words techniques already disambiguate
– Context for each term is established in the query
• WSD is hard!
– Many words are highly polysemous, e.g., interest
– Granularity of senses is often domain/application
specific
• WSD tries to improve precision
– But incorrect sense assignments would hurt recall
– Slight gains in precision do not offset large drops in
recall
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
28
An Alternate Approach
• Indexing word senses “freezes” concepts at index
time
• What if we expanded query terms at query time
instead?
dog AND cat 
( dog OR canine ) AND ( cat OR feline )
• Two approaches
– Manual thesaurus, e.g., WordNet, UMLS, etc.
– Automatically-derived thesaurus, e.g., co-occurrence
statistics
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
29
Does it work?
• Yes… if done “carefully”
• User should be involved in the process
– Otherwise, poor choice of terms can hurt performance
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
30
Handling Anaphora
• Anaphora resolution: finding what the anaphor refers
to (i.e., the antecedent)
John Wilkes Booth altered history with a bullet. He will forever
be known as the man who ended Abraham Lincoln’s life.
He = John Wilkes Booth
• Most common example: pronominal anaphora
resolution
– Simplest method works pretty well: find previous noun
phrase matching in gender and number
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
31
Expanding Anaphors
• When indexing, replace anaphors with their
antecedents
• Does it work?
– Somewhat
– … but can be computationally expensive
– … helps more if you want to retrieve sub-document
segments
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
32
Beyond Word-Level Indexing
• Words are the wrong unit to index…
• Many multi-word combinations identify entities
– Persons: George W. Bush, Dr. Jones
– Organizations: Red Cross, United Way
– Corporations: Hewlett Packard, Kraft Foods
– Locations: Easter Island, New York City
• Entities often have finer-grained structures
Professor Stephen W. Hawking
title
Cambridge, Massachusetts
first name middle initial last name
2008 © ChengXiang Zhai
city
state
Dragon Star Lecture at Beijing University, June 21-30, 2008
33
Indexing Named Entities
• Why would we want to index named entities?
• Index named entities as special tokens
In reality, at the time of Edison’s 1879 patent, the light bulb
PERSON
DATE
had been in existence for some five decades ….
• And treat special tokens like query terms
Who patented the light bulb?
patent light bulb PERSON
When was the light bulb patented?
patent light bulb DATE
• Works pretty well for question answering, but not
evaluated for regular topic retrieval (?)
John Prager, Eric Brown, and Anni Coden. (2000) Question-Answering by Predictive Annotation.
Proceedings
2000.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June
21-30, 2008of SIGIR 34
Indexing Phrases
• Motivation: “bank terminology” vs “terminology bank”
• Two types of phrases
– Those that make sense, e.g., “school bus”, “hot dog”
– Those that don’t, e.g., bigrams in Chinese
• Treat multi-word tokens as index terms
• Three sources of evidence:
– Dictionary lookup
– Linguistic analysis
– Statistical analysis (e.g., co-occurrence)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
35
Known Phrases
• Compile a term list that includes phrases
– Technical terminology can be very helpful
• Index any phrase that occurs in the list
• Most effective in a limited domain
– Otherwise hard to capture most useful phrases
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
36
Syntactic Phrases
• Parsing = automatically assign structure to a
sentence
Sentence
Prepositional Phrase
Noun Phrase
Det
Adj
Noun phrase
Adj Noun Verb Prep Det Adj Adj Noun
The quick brown fox jumped over the lazy black dog
• “Walk” the tree and extract phrases
– Index all noun phrases
– Index subjects and verbs
– Index verbs and objects
– etc.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
37
Syntactic Variations
• What does linguistic analysis buy?
– Coordinations
lung and breast cancer 
lung cancer, breast cancer
– Substitutions
inflammatory sinonasal disease 
inflammatory disease, sinonasal disease
– Permutations
addition of calcium  calcium addition
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
38
Statistical Analysis
• Automatically discover phrases based on cooccurrence probabilities
P(“kick the bucket”) = P(“kick”)  P(“the”)  P(“bucket”) ?
• If terms are not independent, they may form a
phrase
• Use this method to automatically learn a phrase
dictionary
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
39
Does Phrasal Indexing Work?
• Yes…
• But the gains are so small they’re not worth the cost
• Primary drawback: too slow!
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
40
Sample Phrase-Indexing Results[Zhai 97]
Using single type of phrases to supplement single words helps
But using all phrases to supplement single words doesn’t help as much
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
41
What about ambiguity?
• Different documents with the same keywords may
have different meanings…
What do frogs eat?
keywords: frogs, eat
(1) Adult frogs eat mainly insects
and other small animals,
including earthworms, minnows,
and spiders.

What is the largest volcano in the
Solar System?
keywords: largest, volcano, solar, system
(1)
(2) Alligators eat many kinds of
small animals that live in or near
the water, including fish,
snakes, frogs, turtles, small
mammals, and birds.
(2)
(3) Some bats catch fish with their
claws, and a few species eat
lizards, rodents, small birds,
tree frogs, and other bats.
(3)


2008 © ChengXiang Zhai
Mars boasts many extreme
geographic features; for example,
Olympus Mons, is the largest
volcano in the solar system.
The Galileo probe's mission to
Jupiter, the largest planet in the
Solar system, included amazing
photographs of the volcanoes on Io,
one of its four most famous moons.
Even the largest volcanoes found
on Earth are puny in comparison to
others found around our own cosmic
backyard, the Solar System.
Dragon Star Lecture at Beijing University, June 21-30, 2008
42
Indexing Relations
• Instead of terms, index syntactic relations between
entities in the text
Adult frogs eat mainly insects and other small animals,
including earthworms, minnows, and spiders.
< frogs subject-of eat >
< insects object-of eat >
< animals object-of eat >
< adult modifies frogs >
< small modifies animals >
Alligators eat many kinds of small animals that live in or near the water,
including fish, snakes, frogs, turtles, small mammals, and birds.
< alligators subject-of eat >
< kinds object-of animals >
< small modifies animals >
From the relations, it is clear who’s eating whom!
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
43
Are syntactic relations enough?
• Consider this example:
John broke the window.
The window broke.
< John subject-of break >
< window subject-of break>
“John” and “window” are both subjects…
But John is the person doing the breaking (or “agent”),
and the window is the thing being broken (or “theme”)
• Syntax sometimes isn’t enough… we need
semantics (or meaning)!
• Semantics, for example, allows us to relate the
following two fragments:
The barbarians destroyed the city…
The destruction of the city by the barbarians…
event: destroy
agent: barbarians
theme: city
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
44
Semantic Roles
• Semantic roles are invariant with respect to syntactic
expression
Mary loaded the truck with hay.
Hay was loaded onto the truck by Mary.
event: load
agent: Mary
material: hay
destination: truck
• The idea:
– Identify semantic roles
– Index “frame structures” with filled slots
– Retrieve answers based on semantic-level matching
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
45
Does it work?
• No, not really…
• Why not?
– Syntactic and semantic analysis is difficult: errors
offset whatever gain is gotten
– As with WSD, these techniques are precisionenhancers… recall usually takes a dive
– It’s slow!
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
46
Alternative Approach
•
Sophisticated linguistic analysis is slow!
– Unnecessary processing can be avoided by query time analysis
•
Two-stage retrieval
– Use standard document retrieval techniques to fetch a candidate
set of documents
– Use passage retrieval techniques to choose a few promising
passages (e.g., paragraphs)
– Apply sophisticated linguistic techniques to pinpoint the answer
•
Passage retrieval
– Find “good” passages within documents
– Key Idea: locate areas where lots of query terms appear close
together
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
47
Summary
• IR is hard because language is rich and complex
(among other reasons)
• Two general approaches to the problem
– Attempt to find the best unit of indexing
– Try to fix things at query time
• It is hard to predict a priori what techniques work
– Questions must be answered experimentally
• Words are really the wrong thing to index
– But there isn’t really a better alternative…
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
48
What You Should Know
• Lots of effort have been made to improve indexing
units
• Most results, however, are non-promising
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
49
Dependence Language Model
for Information Retrieval
Jianfeng Gao, Jian-Yun Nie,
Guangyuan Wu, Guihong Cao,
Dependence Language Model for
Information Retrieval, SIGIR 2004
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
50
Reference
• Structure and performance of a dependency
language model. Ciprian, David Engle and et al.
Eurospeech 1997.
• Parsing English with a Link Grammar. Daniel D. K.
Sleator and Davy Temperley. Technical Report CMUCS-91-196 1991.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
51
Why we use independence
assumption?
• The independence assumption is one of the
assumptions widely adopted in probabilistic retrieval
theory.
• Why?
– Make retrieval models easier.
– Make retrieval operation tractable.
• The shortage of independence assumption
– Independence assumption does not hold in textual
data.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
52
•
Latest ideas of dependence
assumption
Bigram
– Some language modeling approach try to incorporate word
frequency by using bigram.
– Shortage:
• Some of word dependencies not only exist between
adjacent words but also exist at more distant.
• Some of adjacent words are not exactly connected.
– Bigam language model showed only marginally better
effectiveness than the unigram model.
•
Bi-term
– Bi-term language model is similar to the bigram model except the
constraint of order in terms is relaxed.
– “information retrieval” and “retrieval of information” will be
assigned the same probability of generating the query.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
53
Structure and performance of a
dependency language model
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
54
Introduction
• This paper present a maximal entropy language
model that incorporates both syntax and semantics
via a dependency grammar.
• Dependency grammar: express the relations
between words by a directed graph which can
incorporate the predictive power of words that lie
outside of bigram or trigram range.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
55
Introduction
• Why we use Ngram
– Assume
S  w 0 , w1 , w 2 ..., w n
P ( S )  P ( w 0 ) P ( w 1 | w 0 )... P ( w n | w 0 ... w n 1 )
if we want to record
we need to store
V
P ( w n | w 0 ... w n 1 )
i 1
(V  1) independent
parameters
• The drawback of Ngram
– Ngram blindly discards relevant words that lie N or
more positions in the past.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
56
Structure of the model
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
57
Structure of the model
• Develop an expression for the joint probability
P(S , K )
, K is the linkages in the sentence.
• Then we get

• Assume that the sum is dominated by a single term,
P(S ) 
then
P(S, K ) 
*
where K
*

K
K
P(S, K )
P(S , K )
 arg max
K
P(S , K )
P(S )  P(S, K )
*
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
58
A dependency language model of IR
• A query
we want to rank
Q  ( q1 ... q m )
P (Q | D )
– Previous work:
• Assume independence between query terms :
P (Q | D ) 

i  1 ... m
P ( qi | D )
– New work:
• Assume that term dependencies in a query form a linkage
P (Q | D ) 
 P (Q , L | D )   P ( L | D ) P (Q | L , D )
L
L
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
59
A dependency language model of IR
P (Q | D ) 
 P (Q , L | D )   P ( L | D ) P (Q | L , D )
L
•
L
Assume that the sum  P ( Q , L | D )
Ls is dominated by a single term L*
over all the possible
L
P (Q | D )  P ( L | D ) P (Q | L , D )
such that L  arg max
•
L
P(L | D)
Assume that each term is dependent on exactly one related
query term generated previous.
qh
qi
2008 © ChengXiang Zhai
q
j
Dragon Star Lecture at Beijing University, June 21-30, 2008
60
A dependency language model of IR
P (Q | D )  P ( L | D ) P (Q | L , D )
such that L  arg max
P (Q | L , D )  P ( q h | D )

L
P(L | D)
P ( q j | qi , L , D )
( i , j ) L
 P ( qh | D )

P ( q j , qi | L , D )

P ( q j , qi | L , D ) P ( q j | L , D )
( i , j ) L
 P ( qh | D )
( i , j ) L
qh
P ( qi | L , D )
P ( qi | L , D ) P ( q j | L , D )
qi
2008 © ChengXiang Zhai
q
j
Dragon Star Lecture at Beijing University, June 21-30, 2008
61
A dependency language model of IR
• Assume
P (q j | L, D )  P (q j | D )
– The generation of a single term is independent of L
P (Q | L , D )  P ( q h | D ) P ( q j | D )
jh


i  1 ... m
P ( qi | D )

( i , j ) L
P (qh | D )

( i , j ) L
P ( qi , q j | L, D )
P ( qi | D ) P ( q j | D )
P ( qi , q j | L, D )
P (qi | D ) P (q j | D )

( i , j )L
P ( q j , qi | L, D ) P ( q j | L, D )
P ( qi | L, D ) P ( q j | L, D )
• By this assumption, we would have arrived at the
same result by starting from any term. L can be
represented as an undirected graph.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
62
A dependency language model of IR
P (Q | L , D ) 

P ( qi | D )
i  1 ... m

P ( qi , q j | L, D )
( i , j ) L
P ( qi | D ) P ( q j | D )
P (Q | D )  P ( L | D ) P (Q | L , D )
such that L  arg max
P(L | D)
L
取log
log P ( Q | D )  log P ( L | D ) 

log P ( q i | D ) 
i  1 ... m
MI ( q i , q j | L , D )  log
 MI ( q , q
i
j
| L, D )
( i , j ) L
P (qi , q j | L, D )
P ( qi | D ) P ( q j | D )
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
63
Parameter Estimation
• Estimating
P(L | D)
log P ( Q | D )  log P ( L | D ) 

log P ( q i | D ) 
i  1 ... m
 MI ( q , q
i
j
| L, D )
( i , j ) L
– Assume that the linkages are independent.
P(L | D) 
 P (l | D )
l L
– Then count the relative frequency of link l between
given that they appear in the same sentence.
A score
The link
frequency of
query i
and query j
F ( R | qi , q j ) 
C ( qi , q j , R )
C ( qi , q j )
F ( R | qi , q j )

qi
and
q
j
Have a link
in a sentence
in training data
 P (l | Q )
F ( R | qi , q j )
( i , j )l
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
64
Parameter Estimation
F ( R | qi , q j )

 P (l | Q )
F ( R | qi , q j )
( i , j )l
P (l | Q )  F ( R | qi , q j )
L  arg max P ( L | Q )  arg max
L
L
P(L | D)  P(L | Q, D) 

F ( R | qi , q j )
( i , j ) L
P (l | D ) 
l L
assumption


F ( R | qi , q j )
( i , j ) L
Assumption: P ( L | Q )  P ( L | D )
F ( R | q i , q j )  (1   ) F D ( R | q i , q j )   F C ( R | q i , q j )
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
65
Parameter Estimation
• Estimating P ( q | D )
i
– The document language model is smoothed with a
Dirichlet prior
P ' ( q i | D )  (1   ) P ( q i | D )   P ( q i | C )
 (1   )
C D ( qi )  C C (qi )
C
C
( qi )  
 
qi
C C ( qi )  
C
C
Constant
discount
(qi )
qi
Dirichilet distribution
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
66
Parameter Estimation
• Estimating
MI ( q i , q j | L , D )
MI ( q i , q
 log
j
P (qi , q
| L , D )  log
j
| L, D )
P (qi | L, D ) P (q
j
| L, D )
C D (qi , q j , R ) N
( C D ( q i ,*, R ) N )( C D (*, q j , R ) N )
 log
C D (qi , q j , R ) N
C D ( q i ,*, R ) C D (*, q j , R )
N  C D (*,*, R )
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
67
Experimental Setting
• Stemmed and stop words were removed.
• Queries are TREC topics 202 to 250 on TREC disk 2
and 3.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
68
The flow of the experimental
document Training data
query
For weight
computation
Find the linkage of query
Get F ( R | q i , q j )
Count the frequency
Find the max L by maxlP(l|Q)
FD ( R | qi , q j )
and
Get P(L|D)
FC ( R | q i , q j )
Count the frequency
Get P ( q i | D )
C C ( q i ) and C D ( q i )
combine
Ranking
document
Count the frequency
C D ( q i ,*, R ) and C D ( q i ,*, R )
and C D ( q i , q j , R )
Get MI ( q i , q j | L , D )
and C D (*,*, R )
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
69
Result-BM & UG
• BM: binary independent retrieval
• UG: unigram language model approach
• UG achieves the performance similar to, or worse
than, that of BM.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
70
Result- DM
• DM: dependency model
• The improve of DM over UG is statistically
significant.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
71
Result- BG
• BG: bigram language model
• BG is slightly worse than DM in five
out of six TREC
collections but substantially outperforms UG in all
collection.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
72
Result- BT1 & BT2
• BT: bi-term language model
PBT 1 ( q i | q i 1 , D ) 
1
2
PBT 2 ( q i | q i 1 , D ) 
( PBG ( q i | q i 1 , D )  PBG ( q i 1 | q i , D ))
C D ( q i 1 , q i )  C D ( q i , q i 1 )
2  min{ C D ( q i 1 ), C D ( q i )}
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
73
Conclusion
• This paper introduce the linkage of a query as a
hidden variable.
• Generate each term in turn depending on other
related terms according to the linkage.
– This approach cover several language model
approaches as special cases.
• The experimental of this paper outperforms
substantially over unigram, bigram and classical
probabilistic retrieval model.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
74
Opinion Retrieval from Blogs
Wei Zhang1
[email protected]
Clement Yu1
[email protected]
Weiyi Meng2
[email protected]
1 Department of Computer Science, University of Illinois at Chicago
2 Department of Computer Science, Binghamton University
CIKM 2007
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
75
75
Outline
• Overview of the opinion retrieval
• Topic retrieval
• Opinion identification
• Ranking documents by opinion similarity
• Experimental results
Dragon Star Lecture at Beijing University, June 21-30, 2008
CIKM 2007
2008 © ChengXiang Zhai
76
76
Overview of the Opinion Retrieval
• Opinion retrieval
• Given a query, find documents that have
subjective opinions about the query
A query “book”
Relevant: “This is a very good book.”
Irrelevant: “This book has 123 pages.”
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
77
Overview of the Opinion Retrieval
• Introduced at TREC 2006 Blog Track
• 14 groups, 57 submitted runs in TREC 2006
• 20 groups, 104 runs in TREC 2007 (on going)
• Key problems
• Opinion features
• Query-related opinions
• Rank the retrieved documents
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
78
Our Algorithm
Document set
Query
Retrieved documents
Opinionative documents
Query-related
opinionative documents
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
79
Topic Retrieval
• Retrieve query-relevant documents
• No opinion involved
• Features
• Phrase recognition
• Query expansion
• Two document-query similarities
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
80
Topic Retrieval – Phrase Recognition
• Semantic relationship among the words
• For phrase similarity calculation purpose
• 4 types
•
•
•
•
Proper noun: “University of Lisbon”
Dictionary phrase: “computer science”
Simple phrase: “white car”
Complex phrase: “small white car”
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
81
Topic Retrieval – Query Expansion
• Find the synonyms
• “wto”  “world trade organization”
• Same importance
• Add additional terms
• “wto”  negotiate, agreements, Tariffs,
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
82
Topic Retrieval - Similarity
• Sim(Query, Doc) = <Sim_P, Sim_T>
• Phrase similarity
• Having or not having a phrase
• Sim_P = sum ( idf(P_i) )
• Term similarity
• Sum of the Okapi scores of all the query terms
• Document ranking
• D1 is ranked higher than D2, if
(Sim_P1>Sim_P2) OR (P1=P2 AND T1>T2)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
83
Opinion Identification
Subjective
training data
Objective
training data
Feature Selection
retrieved
documents
SVM classifier
From topic retrieval
2008 © ChengXiang Zhai
opinionativ
e
documents
To opinion ranking
Dragon Star Lecture at Beijing University, June 21-30, 2008
84
Opinion Identification – Training Data
• Subjective training data
• Review web sites
• Documents having opinionative phrases
• Objective training data
• Dictionary entries
• Documents not having opinionative phrases
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
85
Opinion Identification – Feature Selection
• The words expressing opinions
• Pearson’s Chi-square test
• Test of the independence between subjectivity
label and words via contingency table
• Count the number of sentences
• Unigrams and bigrams
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
86
Opinion Identification – Classifier
• A support vector machine (SVM) classifier
Subjective sentences
Objective sentences
Features
Feature vector representation
Training
SVM classifier
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
87
Opinion Identification – Classifier
• Apply the SVM classifier
Document
Sentence 1
SVM
classifier
Sentence 2
Label 1:objective
Label 2:subjective
…
…
Sentence n
Label n:objective
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
88
Opinion Similarity - Query-Related Opinions
• Find the query-related opinions
query opinionative sentence
document
text window
document
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
89
Opinion Similarity – Similarity 1
• Assumption 1
• Higher topic relevance
• Higher rank
• OSim_ir = Sim(Query, Doc)
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
90
Opinion Similarity – Similarity 2
• Assumption 2
• More query-related opinions
• Higher rank
• OSim_stcc: total number of sentences
• OSim_stcs: total score of sentences
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
91
Opinion Similarity – Similarity 3
• A linear combination of 1 and 2
• a * Osim_ir + (1-a) * OSim_stcc
• b * Osim_ir + (1-b) * OSim_stcs
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
92
Opinion Similarity – Experimental Results
• TREC 2006 Blog Track data
• 50 queries, 3.2 million Blog documens
• UIC at TREC 2006 Blog Track
• Title-only queries: scored the first
• 28% - 32% higher than best TREC 2006 scores
• Good things learned
• More training data
• Combined similarity function
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
93
Conclusions
• Designed and implemented an opinion retrieval
system. IR + text classification for opinion retrieval
• The best known retrieval effectiveness on TREC
2006 blog data
• Extend to polarity classification:
positive/negative/mixed
• Plan to improve feature selection
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
94
What You Should Know
• Language models offer more opportunities for
studying weighting of phase-indexing (including
weights on).
• Opinion finding is a hot area now. A major challenge
is how to combine sentiment score with a topic
score.
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
95
Future Research Directions in NLP for IR
• It’s time to revisit many old issues:
– We have better NLP techniques now
– We have better probabilistic retrieval models now
– Possible topics: phrase indexing, word sense
disambiguation, TF-effect of NLP,…
• Language models seem to offer better opportunities
for adopting NLP results (e.g, dependency LM)
• Sentiment retrieval
• NLP for helping difficult topics
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
96
Roadmap
• This lecture: NLP for IR (understanding documents)
• Next lecture: Topic models for text mining
2008 © ChengXiang Zhai
Dragon Star Lecture at Beijing University, June 21-30, 2008
97
Descargar

Powerpoint presentation