Introduction to
Natural Language Processing and Text Mining
The basic building blocks
Sudeshna Sarkar
Computer Science & Engineering Department
Indian Institute of Technology Kharagpur
What is speech and language processing?
Computational Linguistics deals with the modeling of natural
language from a computational perspective.
Natural Language Processing
Process information contained in natural language text / speech
Getting computers to perform useful tasks involving human
– Enabling human-machine communication
– Improving human-human communication
– Doing stuff with language objects
Can machines understand human language?
What does one mean by ‘understand’?
Understanding is the ultimate goal. However, one doesn’t need to
fully understand to be useful.
Natural Language Processing
What is it?
We’re going to study what goes into getting computers to
perform useful and interesting tasks involving human
We will be secondarily concerned with the insights that such
computational work gives us into human processing of
Importance of studying NLP
A hallmark of human intelligence.
Text is the largest repository of human knowledge and
is growing quickly.
emails, news articles, web pages, scientific articles,
insurance claims, customer complaint letters, transcripts of
phone calls, technical documents, government documents,
patent portfolios, court decisions, contracts, ……
Are we reading any faster than before?
How do we keep up?
Goals of NLP
Scientific Goal
Identify the computational machinery needed for an
agent to exhibit various forms of linguistic behaviour
Engineering Goal
Design, implement, and test systems that process
natural languages for practical applications
Computer Speech and Language
Goals can be very ambitious
True text understanding
Good quality translation
Or goals can be practical
Web search engines
Question Answering
Machine Translation services on the Web
Speech synthesis
Voice recognition
Conversational Agents
Natural language technology not yet perfected
But still good enough for several useful applications
Text Mining
Text mining
deriving high quality information from text.
Text mining usually involves
– the process of structuring the input text
– deriving patterns within the structured data
– evaluation and interpretation of the output.
'High quality' in text mining usually refers to some combination of
relevance, novelty, and interestingness.
Typical text mining tasks include
text categorization, text clustering
concept/entity extraction
generation of taxonomies
sentiment analysis
document summarization
entity relation modeling
Big Applications
These kinds of applications require a tremendous
amount of knowledge of language.
Consider the following interaction with HAL the
computer from 2001: A Space Odyssey
Dave: Open the pod bay doors, Hal.
HAL: I’m sorry Dave, I’m afraid I can’t do that.
What’s needed?
Speech recognition and synthesis
Knowledge of the English words involved
What they mean
How they combine (bay, vs. pod bay)
How groups of words clump
What the clumps mean
It is polite to respond, even if you’re planning to kill someone.
It is polite to pretend to want to be cooperative (I’m afraid, I
Real Example
What is the Fed’s current position on interest rates?
What or who is the “Fed”?
What does it mean for it to to have a position?
How does “current” modify that?
NLP has an AI aspect to it.
We’re often dealing with ill-defined problems
We don’t often come up with perfect solutions/algorithms
We can’t let either of those facts get in our way
Basic algorithm and data structure analysis
Ability to program
Some exposure to logic
Exposure to basic concepts in probability
Interest in Language
Commercial World
Lot’s of exciting stuff going on…
Some samples…
Machine translation
Question answering
Buzz analysis
Web Q/A
Current web-based Q/A is limited to returning simple
fact-like (factoid) answers (names, dates, places, etc).
Multi-document summarization can be used to
address more complex kinds of questions.
Circa 2002:
What’s going on with the Hubble?
NewsBlaster Example
The U.S. orbiter Columbia has touched down at the Kennedy Space Center
after an 11-day mission to upgrade the Hubble observatory. The
astronauts on Columbia gave the space telescope new solar wings, a
better central power unit and the most advanced optical camera. The
astronauts added an experimental refrigeration system that will revive a
disabled infrared camera. ''Unbelievable that we got everything we set
out to do accomplished,'' shuttle commander Scott Altman said. Hubble
is scheduled for one more servicing mission in 2004.
Weblog Analytics
Textmining weblogs, discussion forums, user groups,
and other forms of user generated media.
Product marketing information
Political opinion tracking
Social network analysis
Buzz analysis (what’s hot, what topics are people talking
about right now).
Google/Arabic Translation
Forms of Natural Language
The input/output of a NLP system can be:
written text: newspaper articles, letters, manuals, prose, …
Speech: read speech (radio, TV, dictations), conversational speech,
commands, …
To process written text, we need:
knowledge about the language
discourse information,
real world knowledge
To process spoken language, we need additionally
speech recognition
speech synthesis
Components of NLP
Natural Language Understanding
Mapping the given input in the natural language into a useful representation.
Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
Natural Language Generation
Producing output in the natural language from some internal representation.
Different level of synthesis required:
– deep planning (what to say),
– syntactic generation
Natural language understanding
Uncovering the mappings between the linear sequence of words (or
phonemes) and the meaning that it encodes.
Representing this meaning in a useful (usually symbolic) representation.
By definition - heavily dependent on the target task
Words and structures mean different things in different contexts
The required target representation is different for different tasks.
Why is NLU hard?
The mapping between words, their linguistic structure and the meaning that they encode is
extremely complex and difficult to model and decompose.
Natural language is very ambiguous
The goal of understanding is itself task dependent and very complex.
Why NL Understanding is hard?
Natural language is extremely rich in form and structure, and
very ambiguous.
How to represent meaning,
Which structures map to which meaning structures.
Ambiguity: ne input can mean many different things
Lexical (word level) ambiguity -- different meanings of words
Syntactic ambiguity -- different ways to parse the sentence
Interpreting partial information -- how to interpret pronouns
Contextual information -- context of the sentence may affect the meaning
of that sentence.
Many input can mean the same thing.
Interaction among components of the input.
Noisy input (e.g. speech)
Linguistics Levels of Analysis
Phonology: sounds / letters / pronunciation. concerns how
words are related to the sounds that realize them.
Morphology: the structure of words and the laws
concerning the formation of new words from pieces
Syntax: how these sequences are structured, eg,
structures of sentences and the ways individual words are
connected within them
Semantics: concerns what words mean and how these
meaning combine in sentences to form sentence meaning.
The study of context-independent meaning.
Linguistics Levels of Analysis
Pragmatics: concerns how sentences are used in different
situations and how use affects the interpretation of the
Discourse: concerns how the immediately preceding
sentences affect the interpretation of the next sentence.
For example, interpreting pronouns and interpreting the
temporal aspects of the information.
World Knowledge – includes general knowledge about the
world. What each language user must know about the
other’s beliefs and goals.
Knowledge needed
Speech recognition and synthesis
Dictionaries (how words are pronounced)
Phonetics (how to recognize/produce each sound of the language)
Natural language understanding
Knowledge of the natural language words involved
– What they mean
– How they combine
Knowledge of syntactic structure
Dialog and pragmatic knowledge

LING 180 Intro to Computer Speech and Language Lecture 1