Latent Semantic Analysis Keith Trnka Latent Semantic Indexing • Application of Latent Semantic Analysis (LSA) to Information Retrieval • Motivations – Unreliable evidence – synonomy: Many words refer to same object Affects recall – polysemy: Many words have multiple meanings Affects precision LSI Motivation (example) Source: Deerwater et al. 1990 LSA Solution • Terms are overly noisy – analogous to overfitting in Term-by-Document matrix • Terms and documents should be represented by vectors in a “latent” semantic space • LSI essentially infers knowledge from co-occurrence of terms • Assume “errors” (sparse data, non-co-occurrences) are normal and account for them LSA Methods • Start with a Term-by-Document matrix (A, like fig. 15.5) • Optionally weight cells • Apply Singular Value Decomposition: – t = # of terms – d = # of documents – n = min(t, d) At d T t n S n n ( D d n ) T • Approximate using k (semantic) dimensions: T ˆ At d T t k S k k ( D d k ) LSA Methods (cont’d) • So that the Euclidean distance is minimized (hence, a least squares method) • Each row of T is a measure of similarity for a term to a semantic dimension • Likewise for D LSA Application • Querying for Information Retrieval: query is a psuedodocument: – weighted sum over all terms in query of rows of T – compare similarity to all documents in D using cosine similarity measure • Document similarity: vector comparison of D • Term similarity: vector comparison of T LSA Application (cont’d) • Choosing k is difficult – commonly k = 100, 150, 300 or so – overfitting (superfluous dimensions) vs. underfitting (not enough dimensions) – undefined performance • What are the k semantic dimensions? k LSI Performance Source: Dumais 1997 Considerations of LSA • Conceptually high recall: query and document terms may be disjoint • Polysemes not handled well • LSI: Unsupervised/completely automatic • Language independent • CL-LSI: Cross-Language LSI – weakly trained • Computational complexity is high – optimization: random sampling methods • Formal Linear Algebra foundations • Models language acquisition in children Fun Applications of LSA • Synonyms on TOEFL – Train on a corpus of newspapers, Grolier’s encyclopedia, children’s reading – Use Term similarity and random guessing for unknown terms – Best results at k = 400 • The model got 51.5 correct, or 64.4% (52.5% corrected for guessing). By comparison a large sample of applicants to U. S. colleges from non-English speaking countries who took tests containing these items averaged 51.6 items correct, or 64.5% (52.7% corrected for guessing). Subject-Matter Knowledge • Introduction to Psychology multiple choice tests – from New Mexico State University and the University of Colorado, Boulder – LSA scored significantly less than average, although received a passing grade – LSA has trouble with knowledge not in corpus Text Coherence/Comprehension • Kintcsh and colleagues – Convert a text to a logic and test for coherence – Slow, by hand • LSA – Compute cosine distance between sentences or paragraphs (windows of discourse) and the next one – Predicted comprehension with r = .93 Essay Grading Summary • Latent Semantic Analysis has a wide variety of useful applications • Useful in cognitive psychology research • Useful as a generic IR technique • Possibly useful for lazy TAs? References (Papers, etc.) • http://www-nlp.stanford.edu/fsnlp/ir/fsnlp-slides-ir.pdf (Manning and Shutze) • http://www.cs.utk.edu/~lsi/ – Using LSI for Information Retrieval, Information Filtering and Other Things (Dumais 1997) – Indexing by Latent Semantic Analysis (Deerwester, et al. 1990) – Automatic Cross-Language IR Using LSI (Dumais et al 1996) • http://lsa.colorado.edu/ (Landauer, et al) – An Introduction to Latent Semantic Analysis (1998) – A Solution to Plato’s Problem… (1997) – How Well Can Passage Meaning be Derived without Using Word Order? … (1997) References (Books) • NLP Related – Foundations of Statistical NLP Manning and Schutze 2003 • SVD Related – Linear Algebra Strang 1998 – Matrix Computations (3rd ed.) Golub and Van Loan 1996 LSA Example/The End

Descargar
# Latent Semantic Indexing (LSI)