SIMS 290-2:
Applied Natural Language Processing
Marti Hearst
Oct 23, 2006
(Slides developed by Preslav Nakov)
1
Today
Feature selection
TF.IDF Term Weighting
Weka Input File Format
2
Features for Text Categorization
Linguistic features
Words
– lowercase? (should we convert to?)
– normalized? (e.g. “texts”  “text”)
Phrases
Word-level n-grams
Character-level n-grams
Punctuation
Part of Speech
Non-linguistic features
document formatting
informative character sequences (e.g. &lt)
3
When Do We Need
Feature Selection?
If the algorithm cannot handle all possible features
– e.g. language identification for 100 languages using all words
– text classification using n-grams
Good features can result in higher accuracy
What if we just keep all features?
– Even the unreliable features can be helpful.
– But we need to weight them:
 In the extreme case, the bad features can have a weight
of 0 (or very close), which is… a form of feature selection!
4
Why Feature Selection?
Not all features are equally good!
Bad features: best to remove
– Infrequent
 unlikely to be seen again
 co-occurrence with a class can be due to chance
– Too frequent
 mostly function words
– Uniform across all categories
Good features: should be kept
– Co-occur with a particular category
– Do not co-occur with other categories
The rest: good to keep
5
Types Of Feature Selection?
 Feature selection reduces the number of features
 Usually:
 Eliminating features
 Weighting features
 Normalizing features
 Sometimes by transforming parameters
 e.g. Latent Semantic Indexing using Singular Value
Decomposition
 Method may depend on problem type
 For classification and filtering, may want to use
information from example documents to guide selection.
6
Feature Selection
Task independent methods
Document Frequency (DF)
Term Strength (TS)
Task-dependent methods
Information Gain (IG)
Mutual Information (MI)
2 statistic (CHI)
Empirically compared by Yang & Pedersen (1997)
7
Pedersen & Yang Experiments
Compared feature selection methods for text
categorization
5 feature selection methods:
– DF, MI, CHI, (IG, TS)
– Features were just words, not phrases
2 classifiers:
– kNN: k-Nearest Neighbor
– LLSF: Linear Least Squares Fit
2 data collections:
– Reuters-22173
– OHSUMED: subset of MEDLINE (1990&1991 used)
8
Document Frequency (DF)
DF: number of documents a term appears in
Based on Zipf’s Law
Remove the rare terms: (seen 1-2 times)
Spurious
Unreliable – can be just noise
Unlikely to appear in new documents
Plus
Easy to compute
Task independent: do not need to know the classes
Minus
Ad hoc criterion
For some applications, rare terms can be good
discriminators (e.g., in IR)
9
Stop Word Removal
Common words from a predefined list
Mostly from closed-class categories:
– unlikely to have a new word added
– include: auxiliaries, conjunctions, determiners, prepositions,
pronouns, articles
But also some open-class words like numerals
Bad discriminators
uniformly spread across all classes
can be safely removed from the vocabulary
– Is this always a good idea? (e.g. author identification)
10
2 statistic (CHI)
2 statistic (pronounced “kai square”)
A commonly used method of comparing proportions.
Measures the lack of independence between a term and
a category (Yang & Pedersen)
11
2 statistic (CHI)
Is “jaguar” a good predictor for the “auto” class?
Term = jaguar
Term  jaguar
Class = auto
2
500
Class  auto
3
9500
We want to compare:
the observed distribution above; and
null hypothesis: that jaguar and auto are independent
12
2 statistic (CHI)
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?
If independent: Pr(j,a) = Pr(j)  Pr(a)
So, there would be: N  Pr(j,a), i.e. N  Pr(j)  Pr(a)
Pr(j) = (2+3)/N;
Pr(a) = (2+500)/N;
N=2+3+500+9500
Which = N(5/N)(502/N)=2510/N=2510/10005  0.25
Term = jaguar
Term  jaguar
Class = auto
2
500
Class  auto
3
9500
13
2 statistic (CHI)
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?
Term = jaguar
Term  jaguar
expected: fe
Class = auto
Class  auto
2 (0.25)
3
500
9500
observed: fo
14
2 statistic (CHI)
Under the null hypothesis: (jaguar and auto – independent):
How many co-occurrences of jaguar and auto do we expect?
Term = jaguar
Term  jaguar
expected: fe
Class = auto
Class  auto
2 (0.25)
3 (4.75)
500
(502)
9500 (9498)
observed: fo
15
2 statistic (CHI)
2 is interested in (fo – fe)2/fe summed over all table entries:
 ( j, a ) 
2

( O  E ) / E  ( 2  . 25 ) / . 25  ( 3  4 . 75 ) / 4 . 75
2
2
2
 ( 500  502 ) / 502  ( 9500  9498 ) / 9498  12 . 9 ( p  . 001 )
2
2
The null hypothesis is rejected with confidence .999,
since 12.9 > 10.83 (the value for .999 confidence).
Term = jaguar
Term  jaguar
expected: fe
Class = auto
Class  auto
2 (0.25)
3 (4.75)
500
(502)
9500 (9498)
observed: fo
16
2 statistic (CHI)
There is a simpler formula for 2:
A = #(t,c)
C = #(¬t,c)
B = #(t,¬c)
D = #(¬t, ¬c)
N=A+B+C+D
17
2 statistic (CHI)
How to use 2 for multiple categories?
Compute 2 for each category and then combine:
To require a feature to discriminate well across all
categories, then we need to take the expected value of 2:
Or to weight for a single category, take the maximum:
18
2 statistic (CHI)
Pluses
normalized and thus comparable across terms
2(t,c) is 0, when t and c are independent
can be compared to 2 distribution, 1 degree of freedom
Minuses
unreliable for low frequency terms
19
Information Gain
A measure of importance of the feature for predicting
the presence of the class.
Has an information theoretic justification
Defined as:
The number of “bits of information” gained by
knowing the term is present or absent
Based on Information Theory
– We won’t go into this in detail here.
20
Information Gain (IG)
IG: number of bits of information gained by knowing the
term is present or absent
entropy: H(c)
specific
conditional
entropy H(c|t)
t is the term being scored,
ci is a class variable
specific
conditional
entropy H(c|¬t)
21
Mutual Information (MI)
The probability of seeing x and y together
vs
The probably of seeing x anywhere times the probability of
seeing y anywhere (independently).
MI = log ( P(x,y) / P(x)P(y) )
= log(P(x,y)) – log(P(x)P(y))
From Bayes law: P(x,y) = P(x|y)P(y)
= log(P(x|y)P(y)) – log(P(x)P(y))
MI = log(P(x|y) – log(P(x))
22
Mutual Information (MI)
rare terms
get higher scores
Approximation:
A = #(t,c)
C = #(¬t,c)
B = #(t,¬c) D = #(¬t, ¬c)
N=A+B +C+D
does not use
term absence
23
Using Mutual Information
Compute MI for each category and then combine
If we want to discriminate well across all categories, then
we need to take the expected value of MI:
To discriminate well for a single category, then we take
the maximum:
24
Mutual Information
Pluses
I(t,c) is 0, when t and c are independent
Has a sound information-theoretic interpretation
Minuses
Small numbers produce unreliable results
Does not use term absence
25
CHI max, IG, DF
Term strength
Mutual information
From Yang & Pedersen ‘97
26
Feature Comparison
DF, IG and CHI are good and strongly correlated
thus using DF is good, cheap and task independent
can be used when IG and CHI are too expensive
MI is bad
favors rare terms (which are typically bad)
27
Term Weighting
In the study just shown, terms were (mainly)
treated as binary features
If a term occurred in a document, it was
assigned 1
Else 0
Often it us useful to weight the selected
features
Standard technique: tf.idf
28
TF.IDF Term Weighting
TF: term frequency
definition: TF = tij
– frequency of term i in document j
purpose: makes the frequent words for the document
more important
IDF: inverted document frequency
definition: IDF = log(N/ni)
– ni : number of documents containing term i
– N : total number of documents
purpose: makes rare words across documents more
important
TF.IDF (for term i in document j)
definition: tij  log(N/ni)
29
Term Normalization
Combine different words into a single representation
Stemming/morphological analysis
– bought, buy, buys -> buy
General word categories
–
–
–
–
$23.45, 5.30 Yen -> MONEY
1984, 10,000
-> DATE, NUM
PERSON
ORGANIZATION
 (Covered in Information Extraction segment)
Generalize with lexical hierarchies
– WordNet, MeSH
 (Covered later in the semester)
30
What Do People Do In Practice?
1. Feature selection
infrequent term removal
infrequent across the whole collection (i.e. DF)
seen in a single document
most frequent term removal (i.e. stop words)
2. Normalization:
1. Stemming. (often)
2. Word classes (sometimes)
3. Feature weighting: TF.IDF or IDF
4. Dimensionality reduction (sometimes)
31
Weka
Java-based tool for large-scale machine-learning
problems
Tailored towards text analysis
http://weka.sourceforge.net/wekadoc/
32
Weka Input Format
Expects a particular input file format
Called ARFF: Attribute-Relation File Format
Consists of a Header and a Data section
http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6)
33
WEKA File Format: ARFF
@relation heart-disease-simplified
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
Numerical attribute
age numeric
Nominal attribute
sex { female, male}
chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
cholesterol numeric
exercise_induced_angina { no, yes}
Other attribute types:
class { present, not_present}
• String
@data
• Date
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
Missing value
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6)
Slide adapted from Eibe Frank's
34
WEKA Sparse File Format
Value 0 is not represented explicitly
Same header (i.e @relation and @attribute tags)
the @data section is different
Instead of
@data
0, X, 0, Y, "class A"
0, 0, W, 0, "class B"
We have
@data
{1 X, 3 Y, 4 "class A"}
{2 W, 4 "class B"}
This saves LOTS of space for text applications.
Why?
35
Next Time
Wed: Guest lecture by Peter Jackson:
Pure and Applied Research in NLP: The Good, the
Bad, and the Lucky.
Following week:
Text Categorization Algorithms
How to use Weka
36
Descargar

SIMS 290-2: Applied Natural Language Processing: …