Text Mining:
Finding Nuggets in Mountains
of Textual Data
Jochen Dörre, Peter Gerstl, and
Roland Seiffert
Presented By: Jake Happs, 4.11.01
Overview
•
•
•
•
•
Reasons for Text Mining
Special Tasks in Mining Text
Disambiguating Proper Names
Application Types
Customer Intelligence
Reasons for Text Mining
•
•
•
•
Corporate Knowledge “Ore”
Exploiting the Knowledge in Text
The Value of Mining Text
Typical Applications
Corporate Knowledge “Ore”
•
•
•
•
•
Email
Insurance claims
News articles
Web pages
Patent portfolios
• Customer complaint
letters
• Contracts
• Transcripts of phone
calls with customers
• Technical documents
Exploiting Textual Knowledge
• Knowledge Discovery
• Knowledge Management
Value of Text Mining
• Rapid digestion of large corporate
documents, faster than human knowledge
brokers
• Objective and customizable analysis
• Automation of routine tasks
Typical Applications
• Summarizing documents
• Monitoring relations among people, places,
and organizations
• Organize documents by content
• Organize indices for search and retrieval
• Retrieve documents by content
Special Tasks in Mining Text
•
•
•
•
Interpreting Natural Language
Comparison with Data Mining
Extracting Terminology and Relations
Classifying Documents
Interpreting Natural Language
•
•
•
•
Extracting terminology
Extracting relations
Summarizing documents
Extracting models
Comparison of Procedures
Data Mining
• Identify data sets.
• Select features
manually.
• Prepare data.
• Analyze distribution.
Text Mining
• Identify documents.
• Extract features.
• Select features by
algorithm.
• Prepare data.
• Analyze distribution
Terminology and Relations
•
•
•
•
What Terminology Is
Classes of Terms
Instances of Relations
Canonical Forms
What Terminology Is
•
•
•
•
Function words
General-purpose content words and phrases
Technical content words and phrases
Relations
Classes of Terminology
• Proper names
• Technical phrases
• Abbreviations and acronyms
Instances of Relations
•
•
•
•
•
Facts
Dates
Currency values
Percentages
Other measurements
Canonical Forms
•
•
•
•
Numbers convert to normal form.
Dates convert to normal form.
Inflected forms convert to common form.
Alternative names convert to explicit form.
Classifying Documents
• Hierarchical clustering
• Binary relational clustering
• Supervised learning
Disambiguating Proper Names
• Principles of Nominator Design
• The Process in Nominator
Principles of Nominator Design
• Apply heuristics to strings, instead of
interpreting semantics.
• The unit of context for extraction is a
document.
• The unit of context for aggregation is a
corpus.
• The heuristics represent English naming
conventions.
Extracting Proper Names
•
•
•
•
•
Tokenize the words in a document.
Build list of candidate names in document.
Break candidates into smaller names.
Group names into equivalence classes.
Aggregate classes from multiple documents.
Candidate Names
• Extract all sequences of capitalized tokens.
• Exclude adjectives of provenance (e.g. Mr., Dr.,
etc.).
• Exclude certain non-name acronyms (e.g. M.D.,
PhD.).
• Include numerals, unless following a preposition,
comma, date, or number.
• Ignore words in section titles.
• Exclude initial adverbs in sentences.
Splitting Candidates
• Apply heuristics to conjunctions,
prepositions, and possessives.
• Reconstruct shared words.
Building Equivalence Classes
• Discard non-recurring initial words of
sentences.
• Unify variants with heuristics.
• Pick canonical name for each class.
• Categorize each class with heuristics.
• Map canonical name to variants.
• Map variants to canonical name.
Aggregating Classes
• Merge classes that share a variant in
separate documents.
• Both type and spelling of variant must
agree.
• Replace uncertain categories with certain
ones.
Application Types
• Knowledge Discovery (Clustering)
• Information Distillation (Categorization)
Knowledge Discovery
Information Distillation
Customer Intelligence
• Goals
• Process
Customer Intelligence Goals
• What do customers want and need?
• What do customers think of the company?
Customer Intelligence Process
• Corpus of communications with customers
• Cluster the documents to identify issues.
• Characterize the clusters to identify the
conditions for problems.
• Assign new messages to appropriate
clusters.
Summary
•
•
•
•
Reasons for Text Mining
Special Tasks in Mining Text
Disambiguating Proper Names
Customer Intelligence
Exam Question #1
• Name an example of each of the two main
classes of applications of text mining.
– Knowledge Discovery: Discovering a common
customer complaint among much feedback.
– Information Distillation: Filtering future
comments into pre-defined categories
Exam Question #2
• How does the procedure for text mining
differ from the procedure for data mining?
– Adds feature extraction function
– Not feasible to have humans select features
– Highly dimensional, sparsely populated feature
vectors
Exam Question #3
• In the Nominator program of IBM’s
Intelligent Miner for Text, an objective of
the design is to enable rapid extraction of
names from large amounts of text. How
does this decision affect the ability of the
program to interpret the semantics of text?
– Does not perform in-depth syntactic or
semantic analyses of texts
Questions & Answers
Descargar

Text Mining: Finding Nuggets in Mountains of Textual Data