Natural Language
Processing
Lecture 1
Sudeshna Sarkar
26 July 2007
10/4/2015
1
Notes adapted from Martin’s
NLP slides
10/4/2015
2
Text Books
Daniel Jurafsky, and James H. Martin, "Speech and Language
Processing", Prentice Hall, 2000.
Other References
 James Allen, "Natural Language Understanding", Second
edition,
Pearson
 Christopher D. Manning, and Hinrich Schutze, "Foundations of
Statistical Natural Language Processing", The MIT Press,
1999.

10/4/2015
3
Final Project


10/4/2015
This will be a research-oriented project. The
goal is to have a paper suitable for a
conference submission.
These will preferably be done in groups.
4
Natural Language Processing

10/4/2015
What is it?

We’re going to study what goes into getting
computers to perform useful and interesting tasks
involving human languages.

We will be secondarily concerned with the insights
that such computational work gives us into human
processing of language.
5
Why Should You Care?
Two trends
1.
2.
10/4/2015
An enormous amount of knowledge is now
available in machine readable form as
natural language text
Conversational agents are becoming an
important form of human-computer
communication
6
Major Topics




Words
Syntax
Meaning
Dialog and Discourse
10/4/2015
Applications
7
Applications

First, what makes an application a
language processing application (as
opposed to any other piece of
software)?

10/4/2015
An application that requires the use of knowledge about
human languages
 Example: Is Unix wc (word count) a language
processing application?
8
Applications

Word count?

When it counts words: Yes


When it counts lines and bytes: No

10/4/2015
To count words you need to know what a word is.
That’s knowledge of language.
Lines and bytes are computer artifacts, not linguistic
entities
9
Big Applications




Question answering
Conversational agents
Summarization
Machine translation
10/4/2015
10
Big Applications


These kinds of applications require a
tremendous amount of knowledge of
language.
Consider the following interaction with HAL
the computer from 2001: A Space Odyssey
10/4/2015
11
HAL


10/4/2015
Dave: Open the pod bay doors, Hal.
HAL: I’m sorry Dave, I’m afraid I can’t do
that.
12
What’s needed?



Speech recognition and synthesis
Knowledge of the English words involved
 What they mean
 How they combine (bay, vs. pod bay)
How groups of words clump
 What the clumps mean
10/4/2015
13
What’s needed?

Dialog


10/4/2015
It is polite to respond, even if you’re planning to
kill someone.
It is polite to pretend to want to be cooperative
(I’m afraid, I can’t…)
14
Real Example
What is the Fed’s current position on interest rates?



What or who is the “Fed”?
What does it mean for it to to have a position?
How does “current” modify that?
10/4/2015
15
Caveat
NLP has an AI aspect to it.



10/4/2015
We’re often dealing with ill-defined problems
We don’t often come up with perfect
solutions/algorithms
We can’t let either of those facts get in our way
16
Preparation




Basic algorithm and data
structure analysis
Ability to program
Some exposure to logic
Exposure to basic
concepts in probability
10/4/2015


Familiarity with
linguistics, psychology,
and philosophy
Ability to write well in
English
17
Topics: Linguistics




10/4/2015
Word-level processing
Syntactic processing
Lexical and compositional semantics
Discourse and dialog processing
18
Topics: Techniques



Finite-state methods
Context-free methods
Augmented grammars


10/4/2015
Unification
Logic


Probabilistic
versions
Supervised
machine learning
19
Topics: Applications

Small


Medium




Spelling correction
Word-sense
disambiguation
Named entity recognition
Information retrieval
Large



10/4/2015
Question answering
Conversational agents
Machine translation
20
Commercial World


Lot’s of exciting stuff going on…
Some samples…



10/4/2015
Machine translation
Question answering
Buzz analysis
21
Google/Arabic
10/4/2015
22
Google/Arabic Translation
10/4/2015
23
Web Q/A
10/4/2015
24
Summarization


Current web-based Q/A is limited to returning
simple fact-like (factoid) answers (names,
dates, places, etc).
Multi-document summarization can be used
to address more complex kinds of questions.
Circa 2002:
What’s going on with the Hubble?
10/4/2015
25
NewsBlaster Example
The U.S. orbiter Columbia has touched down at the Kennedy Space
Center after an 11-day mission to upgrade the Hubble
observatory. The astronauts on Columbia gave the space
telescope new solar wings, a better central power unit and the
most advanced optical camera. The astronauts added an
experimental refrigeration system that will revive a disabled
infrared camera. ''Unbelievable that we got everything we set out
to do accomplished,'' shuttle commander Scott Altman said.
Hubble is scheduled for one more servicing mission in 2004.
10/4/2015
26
Weblog Analytics

Textmining weblogs, discussion forums, user
groups, and other forms of user generated
media.




10/4/2015
Product marketing information
Political opinion tracking
Social network analysis
Buzz analysis (what’s hot, what topics are people
talking about right now).
27
Web Analytics
10/4/2015
28
Umbria
10/4/2015
29
Forms of Natural Language

The input/output of a NLP system can be:



written text: newspaper articles, letters, manuals, prose, …
Speech: read speech (radio, TV, dictations), conversational
speech, commands, …
To process written text, we need:
lexical,
 syntactic,
 Semantic
knowledge about the language
 discourse information,
 real world knowledge


To process spoken language, we need additionally


10/4/2015
speech recognition
speech synthesis
30
Components of NLP

Natural Language Understanding
 Mapping the given input in the natural language into a useful
representation.
 Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …

Natural Language Generation
 Producing output in the natural language from some internal
representation.
 Different level of synthesis required:
 deep planning (what to say),
10/4/2015
31
 syntactic generation
Natural language understanding



Uncovering the mappings between the linear sequence of words (or
phonemes) and the meaning that it encodes.
Representing this meaning in a useful (usually symbolic)
representation.
By definition - heavily dependent on the target task


Words and structures mean different things in different contexts
The required target representation is different for different tasks.
Why is NLU hard?



The mapping between words, their linguistic structure and the meaning that they
encode is extremely complex and difficult to model and decompose.
Natural language is very ambiguous
The goal of understanding is itself task dependent and very complex.
10/4/2015
32
Why NL Understanding is hard?

Natural language is extremely rich in form and structure, and
very ambiguous.



Ambiguity: ne input can mean many different things







How to represent meaning,
Which structures map to which meaning structures.
Lexical (word level) ambiguity -- different meanings of words
Syntactic ambiguity -- different ways to parse the sentence
Interpreting partial information -- how to interpret pronouns
Contextual information -- context of the sentence may affect the
meaning of that sentence.
Many input can mean the same thing.
Interaction among components of the input.
Noisy input (e.g. speech)
10/4/2015
33
Knowledge of Language

Phonology – concerns how words are related to the sounds that
realize them.

Morphology – concerns how words are constructed from more
basic meaning units called morphemes. A morpheme is the primitive
unit of meaning in a language.

Syntax – concerns how can be put together to form correct
sentences and determines what structural role each word plays in
the sentence and what phrases are subparts of other phrases.

Semantics – concerns what words mean and how these meaning
combine in sentences to form sentence meaning. The study of
context-independent meaning.
10/4/2015
34
Knowledge of Language

Pragmatics – concerns how sentences are used in different
situations and how use affects the interpretation of the
sentence.

Discourse – concerns how the immediately preceding
sentences affect the interpretation of the next sentence.For
example, interpreting pronouns and interpreting the temporal
aspects of the information.

World Knowledge – includes general knowledge about the
world. What each language user must know about the other’s
beliefs and goals.
10/4/2015
35
Ambiguity
At last, a computer that understands you like your
mother.
-- 1985 McDonnell-Douglas Ad
Different interpretations:
1.
The computer understands you as well as your mother
understands you.
2.
The computer understands that you like your mother.
3.
The computer understands you as well as it understands
your mother.
Speech : ….. a computer that understands your lie cured mother
…
10/4/2015
36
Why is NLP difficult?

Because Natural Language is highly ambiguous.
 Syntactic ambiguity



The president spoke to the nation about the
problem of drug use in the schools from one coast
to the other.
has 720 parses.
Ex:



10/4/2015
“to the other” can attach to any of the previous NPs (ex.
“the problem”), or the head verb  6 places
“from one coast” has 5 places to attach
…
37
Why is NLP difficult?

Word category ambiguity


Word sense ambiguity




People like ice-cream.
Does this mean that all (or some?) people like ice cream?
Language is changing and evolving
 I’ll email you my answer.


10/4/2015
make up a story
Fictitious worlds
 People on mars can fly.
Defining scope


bank --> financial institution? building? or river side?
Words can mean more than their sum of parts


book --> verb? or noun?
This new S.U.V. has a compartment for your mobile phone.
Googling, …
38
Resolve Ambiguities

We will introduce models and algorithms to resolve
ambiguities at different levels.

part-of-speech tagging -- Deciding whether duck is verb or
noun.

word-sense disambiguation -- Deciding whether make is
create or cook.

lexical disambiguation -- Resolution of part-of-speech and
word-sense ambiguities are two important kinds of lexical
disambiguation.

syntactic ambiguity -- her duck is an example of syntactic
ambiguity, and can be addressed by probabilistic parsing.
10/4/2015
39
Resolve Ambiguities (cont.)
I made her duck
S
NP
I
S
VP
NP
V
NP
NP
made
her
duck
I
VP
V
NP
made
DET
N
her duck
10/4/2015
40
Dealing with Ambiguity

Three approaches:



10/4/2015
Tightly coupled interaction among processing levels;
knowledge from other levels can help decide among
choices at ambiguous levels.
Pipeline processing that ignores ambiguity as it occurs
and hopes that other levels can eliminate incorrect
structures.
 Syntax proposes/semantics disposes approach
Probabilistic approaches based on making the most
likely choices
41
Models to Represent
Linguistic Knowledge





Different formalisms (models) are used to represent
the required linguistic knowledge.
State Machines -- FSAs, HMMs, ATNs, RTNs
Formal Rule Systems -- Context Free Grammars,
Unification Grammars, Probabilistic CFGs.
Logic-based Formalisms -- first order predicate
logic, some higher order logic.
Models of Uncertainty -- Bayesian probability
theory.
10/4/2015
42
Algorithms to Manipulate
Linguistic Knowledge


We will use algorithms to manipulate the models of linguistic
knowledge to produce the desired behavior.
Most of the algorithms we will study are transducers and
parsers.



These algorithms construct some structure based on their input.
Since the language is ambiguous at all levels,
these algorithms are never simple processes.
Categories of most algorithms that will be used can fall into
following categories.


10/4/2015
state space search
dynamic programming
43
Language and Intelligence
Turing Test
Computer
Human
Human Judge





Human Judge asks tele-typed questions to Computer and
Human.
Computer’s job is to act like a human.
Human’s job is to convince Judge that he is not machine.
Computer is judged “intelligent” if it can fool the judge
Judgment of intelligence is linked to appropriate answers to
questions from the system.
10/4/2015
44
NLP - an inter-disciplinary
Field







NLP borrows techniques and insights from several disciplines.
Linguistics: How do words form phrases and sentences?
What constraints the possible meaning for a sentence?
Computational Linguistics: How is the structure of
sentences are identified? How can knowledge and reasoning
be modeled?
Computer Science: Algorithms for automatons, parsers.
Engineering: Stochastic techniques for ambiguity resolution.
Psychology: What linguistic constructions are easy or difficult
for people to learn to use?
Philosophy: What is the meaning, and how do words and
sentences acquire it?
10/4/2015
45
Some Buzz-Words

NLP – Natural Language Processing
CL – Computational Linguistics
SP – Speech Processing
HLT – Human Language Technology
NLE – Natural Language Engineering
SNLP – Statistical Natural Language Processing

Other Areas:








10/4/2015
Speech Generation, Text Generation, Speech Understanding,
Information Retrieval,
Dialogue Processing, Inference, Spelling Correction, Grammar
Correction,
Text Summarization, Text Categorization,
46
Some NLP Applications

Machine Translation – Translation between two natural languages.

Babel Fish translations system, Systran

Information Retrieval – Web search (uni-lingual or multi-lingual).

Query Answering/Dialogue – Natural language interface with a
database system, or a dialogue system.

Report Generation – Generation of reports such as weather reports.

Other Applications –

10/4/2015
Grammar Checking, Spell Checking, Spell Corrector
47
The Big Picture
Source Language
Speech Signal
Speech recognition
Source text Analysis
10/4/2015
Target Language
Speech Signal
Speech Synthesis
Target text Generation
49
The Reductionist Approach
10/4/2015
Source Language Analysis
Target Language Generation
Text Normalization
Text Rendering
Morphological Analysis
Morphological Synthesis
POS Tagging
Phrase Generation
Parsing
Role Ordering
Semantic Analysis
Lexical Choice
Discourse Analysis
Discourse Planning
50
Natural Language
Understanding
Words
Morphological Analysis
Morphologically analyzed words
(another step: POS tagging)
Syntactic Analysis
Syntactic Structure
Semantic Analysis
Context-independent meaning representation
Discourse Processing
Final meaning representation
10/4/2015
51
Natural Language Generation
Meaning representation
Utterance Planning
Meaning representations for sentences
Sentence Planning and Lexical Choice
Syntactic structures of sentences with lexical choices
Sentence Generation
Morphologically analyzed words
Morphological Generation
Words
10/4/2015
52
Natural Language Generation

NLG is the process of constructing natural language
outputs from non-linguistic inputs.



the reverse process of NL understanding.
A NLG system may have two main parts:
 Discourse Planning -- what will be generated,
 Surface Realization -- realizes a sentence from
its internal representation.
Lexical Choice

10/4/2015
selecting the correct words describing the concepts.
53
Machine Translation


Machine Translation -- converting a text in language A into the
corresponding text in language B (or speech).
Different Machine Translation architectures:




interlingua based systems
transfer based systems
How to acquire the required knowledge resources such as
mapping rules and bi-lingual dictionary? By hand or acquire
them automatically from corpora.
Example Based Machine Translation acquires the required
knowledge (some of it or all of it) from corpora.
10/4/2015
54
Some statistics (old)



Business e-mail sent per day in the US: 2.1Billion
First class mail per year: 107 Billion
Text on Internet





(2/99): > 6TB
Current: ?
indexed: 16% (Lawrence and Giles, Nature 400,
1999)
Dialog (www.dialog.com): 9 TB
Average college library: 1 TB
10/4/2015
55
Languages


Languages: 39,000 languages and dialects (22,000 dialects in India
alone)
Top languages:










Chinese/Mandarin (885M),
Spanish (332M),
English (322M),
Bengali (189M),
Hindi (182M),
Portuguese (170M), Russian (170M), Japanese (125M)
Source: www.sil.org/ethnologue, www.nytimes.com
Internet: English (128M), Japanese (19.7M), German (14M),
Spanish (9.4M), French (9.3M), Chinese (7.0M)
Usage: English (1999-54%, 2001-51%, 2003-46%, 2005-43%)
Source: www.computereconomics.com
10/4/2015
56
Descargar

CSCI 5832 Natural Language Processing