Latent Semantic Analysis
Keith Trnka
Latent Semantic Indexing
• Application of Latent Semantic Analysis (LSA) to
Information Retrieval
• Motivations
– Unreliable evidence
– synonomy: Many words refer to same object
Affects recall
– polysemy: Many words have multiple meanings
Affects precision
LSI Motivation (example)
Source: Deerwater et al. 1990
LSA Solution
• Terms are overly noisy
– analogous to overfitting in Term-by-Document matrix
• Terms and documents should be represented by vectors in
a “latent” semantic space
• LSI essentially infers knowledge from co-occurrence of
• Assume “errors” (sparse data, non-co-occurrences) are
normal and account for them
LSA Methods
• Start with a Term-by-Document matrix (A, like fig. 15.5)
• Optionally weight cells
• Apply Singular Value Decomposition:
– t = # of terms
– d = # of documents
– n = min(t, d)
At  d  T t  n  S n  n  ( D d  n )
• Approximate using k (semantic) dimensions:
At  d  T t  k  S k  k  ( D d  k )
LSA Methods (cont’d)
• So that the Euclidean distance is minimized (hence, a least
squares method)
• Each row of T is a measure of similarity for a term to a
semantic dimension
• Likewise for D
LSA Application
• Querying for Information Retrieval: query is a psuedodocument:
– weighted sum over all terms in query of rows of T
– compare similarity to all documents in D using cosine similarity
• Document similarity: vector comparison of D
• Term similarity: vector comparison of T
LSA Application (cont’d)
• Choosing k is difficult
– commonly k = 100, 150, 300 or so
– overfitting (superfluous dimensions) vs. underfitting (not enough
– undefined
• What are the k semantic dimensions?
LSI Performance
Source: Dumais 1997
Considerations of LSA
• Conceptually high recall: query and document terms may
be disjoint
• Polysemes not handled well
• LSI: Unsupervised/completely automatic
• Language independent
• CL-LSI: Cross-Language LSI
– weakly trained
• Computational complexity is high
– optimization: random sampling methods
• Formal Linear Algebra foundations
• Models language acquisition in children
Fun Applications of LSA
• Synonyms on TOEFL
– Train on a corpus of newspapers, Grolier’s encyclopedia, children’s
– Use Term similarity and random guessing for unknown terms
– Best results at k = 400
• The model got 51.5 correct, or 64.4% (52.5% corrected for guessing).
By comparison a large sample of applicants to U. S. colleges from
non-English speaking countries who took tests containing these items
averaged 51.6 items correct, or 64.5% (52.7% corrected for guessing).
Subject-Matter Knowledge
• Introduction to Psychology multiple choice tests
– from New Mexico State University and the University of Colorado,
– LSA scored significantly less than average, although received a passing
– LSA has trouble with knowledge not in corpus
Text Coherence/Comprehension
• Kintcsh and colleagues
– Convert a text to a logic and test for coherence
– Slow, by hand
– Compute cosine distance between sentences or paragraphs
(windows of discourse) and the next one
– Predicted comprehension with r = .93
Essay Grading
• Latent Semantic Analysis has a wide variety of useful
• Useful in cognitive psychology research
• Useful as a generic IR technique
• Possibly useful for lazy TAs?
References (Papers, etc.)
• (Manning and
– Using LSI for Information Retrieval, Information Filtering and Other
Things (Dumais 1997)
– Indexing by Latent Semantic Analysis (Deerwester, et al. 1990)
– Automatic Cross-Language IR Using LSI (Dumais et al 1996)
• (Landauer, et al)
– An Introduction to Latent Semantic Analysis (1998)
– A Solution to Plato’s Problem… (1997)
– How Well Can Passage Meaning be Derived without Using Word Order?
… (1997)
References (Books)
• NLP Related
– Foundations of Statistical NLP
Manning and Schutze 2003
• SVD Related
– Linear Algebra
Strang 1998
– Matrix Computations (3rd ed.)
Golub and Van Loan 1996
LSA Example/The End

Latent Semantic Indexing (LSI)