Natural Language Processing
Julia Hirschberg
COMS 4705
Fall 2010
CS 4705
What is Natural Language Processing?
• Software that can recognize, analyze and generate
text and speech
• AKA computational linguistics
• At Columbia:
Michael Collins, CS, parsing, machine translation
Mona Diab, CCLS, semantics
Nizar Habash, CCLS, morphology, machine translation
Julia Hirschberg, CS, spoken language processing
Kathy McKeown, CS, summarization, generation
Becky Passonneau, CCLS, dialogue systems, reference
– Owen Rambow, CCLS, syntax, parsing
Why is NLP hard? Some Headlines…
Something Went Wrong In Jet Crash, Expert Says
Police Begin Campaign To Run Down Jaywalkers
Drunk Gets Nine Months In Violin Case
Farmer Bill Dies In House
Iraqi Head Seeks Arms
Enraged Cow Injures Farmer With Ax
Stud Tires Out
Eye Drops Off Shelf
Teacher Strikes Idle Kids
Squad Helps Dog Bite Victim
What will we learn about in this course?
• Morphology: the way words are formed
• Syntax: the way words are grouped together into
larger constituents and phrases and the way these
phrases can be ordered
• Semantics: the context-independent ‘meaning’ of
• Pragmatics: the context-dependent ‘meaning’ of
• Goal: What is a speaker/writer meaning to
• Stud tires out: Is `stud’ an adjective or a noun?
`tires’: a noun or a verb?
• Internet search: `union activities in New York’
– What to look for?
• Union/unions; activities/activity
• Active? Action? Actor? Actual? Academic?
• New vs. New York, York vs. yorkie
• Constituent Structure:
– Teacher Strikes Idle Kids
– Enraged Cow Injures Farmer With Ax
• Word Order and Position and Meaning
John hit Bill.
Bill was hit by John.
Bill, John hit.
Who John hit was Bill.
I said John hit Bill.
John hits Bill.
• Word meaning – semantic roles
– John picked up a bad cold.
– John picked up a large rock.
– John picked up Radio Netherlands on his radio.
• Is meaning compositional?
– Squad helps dog bite victim
– Enraged cow injures farmer with ax
• Going Home, a play in one act (thanks to Bonnie
– Scene 1: Pennsylvania Station, NY
• Bonnie: Long Beach?
• Passerby: Downstairs, LIRR Station.
– Scene 2: Ticket Counter, LIRR Station
• Bonnie: Long Beach?
• Clerk: $4.50.
– Scene 3: Information Booth, LIRR Station
• Bonnie: Long Beach?
• Clerk: 4:19, Track 17.
– Scene 4: On the train, vicinity of Forest Hills
• Bonnie: Long Beach?
• Conductor: Change at Jamaica.
– Scene 5: On the next train, vicinity of Lynbrook
• Bonnie: Long Beach?
• Conductor: Right after Island Park.
• Rule-based
– Symbolic Parsers and morphological analyzers
– Finite state automata
• Probabilistic/statistical
– Learned from observation of (labeled) data
– Predicting new data based on old
– Machine learning
Current Real-World Applications
• Search: very large corpora, e.g. Google
• Question answering: e.g. IBM’s Jeopardy!,
DARPA who/what/where…, Ask Jeeves
• Translating between one language and another:
e.g. Google Translate, Babelfish
• Summarizing very large amounts of text or
speech: e.g. your email, the news, voicemail
• Sentiment analysis: restaurant or movie reviews
• Dialogue systems: e.g. Amtrak’s ‘Julie’
• Julia Hirschberg
CEPSR 705, [email protected]
Focus: Spoken Language Processing
Lab: The Speech Lab, CEPSR 7LW3-A
• Deceptive speech
• Charismatic speech:
• Emotional speech: anger, uncertainty
• Speech summarization: Broadcast News
• Spoken Dialogue Systems: Games Corpus
• `Translating Prosody’: English – Mandarin
• Text2Scene Synthesis
Course Details
• Teaching Assistants:
– Mohamed Altantawy
• Email: [email protected]
• Office Hours: CEPSR 7LW1 (Speech Lab), W 5-6,
Th 5:30-6:30
• Will manage CVN course
– Wei Yun Ma
• Email: [email protected]
• Office Hours: CEPSR 725, Tu 10-12
• Text: Daniel Jurafsky and James H. Martin,
Speech and Language Processing, second edition
– Note errata available on website
• Check courseworks for additional information on
class, homework assignments, posting questions
• Assignments:
– 3 homework assignments: Question-answering, text
classification, delightful surprise
– Midterm and final exams
– Five ‘free’ late days for homeworks -- after that 10%
off per late day– not usable on HW1 though
– You will need a CS account
Recorded Lecture Availability
• For on-campus students
– On CVN website
HW1: 10%
Hw2: 20%
Hw3: 20%
Midterm: 15%
Final: 25%
Class participation: 10%
Academic Integrity
Copying or paraphrasing someone's work (code
included), or permitting your own work to be copied
or paraphrased, even if only in part, is forbidden, and
will result in an automatic grade of 0 for the entire
assignment or exam in which the copying or
paraphrasing was done. Your grade should reflect
your own work. If you are going to have trouble
completing an assignment, talk to the instructor or
TA in advance of the due date please. Everyone:
Read/write protect your homework files at all times.
For Next Class
• Look at syllabus – ask questions about anything
you don’t understand
• Read Chapters 1-2 of J&M