Speech and Language
Processing:
Where have we been
and where are we going?
Kenneth Ward Church
AT&T Labs-Research
[email protected]
www.research.att.com/~kwc
Abbreviated version of
Eurospeech Planary
+
Annotated Bibliography
+
Some recent work
Where have we been?
How To Cook A Demo
(After Dinner Talk at TMI-1992 & Invited Talk at TMI-2002)
• Great fun!
• Effective demos
Message for
After Dinner Talk
– Theater, theater, theater
– Production quality matters
– Entertainment >> evaluation
– Strategic vision >> technical correctness
• Success/Catastrophe
Message for
After Breakfast Talk
– Warning: demos can be too effective
– Dangerous to raise unrealistic expectations
Eurospeech 2003
2
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Eurospeech 2003
3
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
Classic example of a demo  embarrassment in retrospect
Eurospeech 2003
4
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
2.
Classic example of a demo  embarrassment in retrospect
Translating telephone (late 1980s) video
–
–
Pierre Isabelle pulled a similar demo because it was so effective
The limitations of the technology were hard to explain to public
•
Though well understood by research community
Eurospeech 2003
5
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
2.
Classic example of a demo  embarrassment in retrospect
Translating telephone (late 1980s) video
–
–
Pierre Isabelle pulled a similar demo because it was so effective
The limitations of the technology were hard to explain to public
•
3.
Though well understood by research community
Apple (~1990) video
–
–
Still having trouble setting appropriate expectations
Factoid: the day of this demo, speech recognition deployed at scale
in AT&T network – with significant lasting impact – but little media
Eurospeech 2003
6
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
2.
Classic example of a demo  embarrassment in retrospect
Translating telephone (late 1980s) video
–
–
Pierre Isabelle pulled a similar demo because it was so effective
The limitations of the technology were hard to explain to public
•
3.
Apple (~1990) video
–
–
4.
Though well understood by research community
Still having trouble setting appropriate expectations
Factoid: the day of this demo, speech recognition deployed at scale
in AT&T network – with significant lasting impact – but little media
Andy Rooney (~1990): reset expectations video
Eurospeech 2003
7
Outline: Where have we been
and where are we going?
1.
Consistent progress over decades

2.
Moore’s Law, Speech Coding, Error Rate
History repeats itself
•
•
•
•
3.
Empiricism: 1950s
Rationalism: 1970s
Empiricism: 1990s
Rationalism: 2010s (?)
Demonstrate
consistent
progress
over time
Managing
Expectations
Oscillations
Discontinuities: Fundamental changes that invalidate
fundamental assumptions
•
•
•
•
Petabytes: $2,000,000  $2,000
Disruptive Discontinuities
Can demand keep up with supply?
If not  Tech meltdown
New priorities: Search >> Compression & Dictation
Eurospeech 2003
8
Charles Wayne’s Challenge:
Demonstrate Consistent Progress Over Time
Managing
Expectations
•
Controversial in 1980s
–
–
•
But not in 1990s
Though, lgrumbling
Benefits
1. Agreement on what to do
2. Limits endless discussion
3. Helps sell the field
•
•
•
Manage expectations
Fund raising
Risks (similar to benefits)
1. All our eggs are in one
basket (lack of diversity)
2. Not enough discussion
•
Hard to change course
3. Methodology  Burden
Eurospeech 2003
9
$
Hockey Stick
Business Case
2002
Last
Year
2003
This
Year
Eurospeech 2003
t
2004
Next
Year
10
Moore’s Law: Ideal Answer
Where have we been and where are we going?
Eurospeech 2003
11
Error Rate
Borrowed Slide
Audrey Le (NIST)
Moore’s Law Time Constant:
• 10x improvement per decade
• Limited by R&D Investment
• (Not Physics)
Date (15 years)
Eurospeech 2003
12
Milestones in Speech and Multimodal
Technology Research
Borrowed
Slide
Small
Vocabulary,
Acoustic
Phoneticsbased
Isolated
Words
Filter-bank
analysis;
Timenormalization
;Dynamic
programming
1962
Medium
Large
Vocabulary,
Vocabulary,
Template-based Statistical-based
Isolated Words;
Connected Digits;
Continuous
Speech
Pattern
recognition; LPC
analysis;
Clustering
algorithms; Level
building;
1967
1972
Connected
Words;
Continuous
Speech
Continuous
Speech; Speech
Understanding
Hidden Markov
models;
Stochastic
Language
modeling;
Stochastic language
understanding;
Finite-state
machines;
Statistical learning;
1977
1982
Very Large
Vocabulary;
Semantics,
Multimodal
Dialog, TTS
Large
Vocabulary;
Syntax,
Semantics,
1987
1992
Spoken dialog;
Multiple
modalities
Concatenative
synthesis; Machine
learning; Mixedinitiative dialog;
1997
2002
Year
Consistent improvement over time, but unlikeEurospeech
Moore’s
Law, hard to extrapolate (predict future)
2003
13
Speech-Related Technologies
Where will the field go in 10 years?
Niels Ole Bernsen (ed)
2003 Useful speech recognition-based language tutor
2003 Useful portable spoken sentence translation systems
2003 First pro-active spoken dialogue with situation awareness
2004 Satisfactory spoken car navigation systems
2005
Small-vocabulary (> 1000 words)
spoken conversational systems
2006
Multiple-purpose personal assistants
(spoken dialog, animated characters)
2006 Task-oriented spoken translation systems for the web
2006 Useful speech summarization systems in top languages
2008 Useful meeting summarization systems
2010 Medium-size vocabulary conversational systems
Eurospeech 2003
14
Where have we been and where are we going?
Manage
Consistent Progress over Time
Expectations
Extrapolation/Prediction
is Not Applicable
$
Extrapolation/Prediction
is Applicable
2002
2003
2004
t
Eurospeech 2003
15
Where have we been
and where are we going?
1.
Consistent progress over decades
•

Moore’s Law, Speech Coding, Error Rate
History repeats itself
•
•
•
•
3.
Empiricism: 1950s
Rationalism: 1970s
Empiricism: 1990s
Rationalism: 2010s (?)
Discontinuities: Fundamental changes that invalidate
fundamental assumptions
•
•
•
•
Petabytes: $2,000,000  $2,000
Can demand keep up with supply?
If not Tech meltdown
New priorities: Search >> Compression & Dictation
Eurospeech 2003
16
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Eurospeech 2003
17
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Eurospeech 2003
18
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Eurospeech 2003
19
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
• Periodic signals are continuous
• Support extrapolation/prediction
• Progress? Consistent progress?
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
Consistent progress?
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Extrapolation/Prediction: Applicable?
Eurospeech 2003
20
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
– Has too much happened since TMI-1992?
• I worry that the pendulum has swung so far that
– We are no longer training students for the possibility
•
that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
Eurospeech 2003
21
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
Eurospeech 2003
22
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
Eurospeech 2003
23
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
– Statistics and Machine Learning
– as well as Linguistic Theory
• History repeats itself:
–
–
–
–
1950s: empiricism
1970s: rationalism (empiricist methodology became too burdensome)
1990s: empiricism
2010s: rationalism (empiricist methodology is burdensome, again)
Mark Twain; bad idea then
and still a bad idea now
Eurospeech 2003
24
Rationalism
Well-known
Chomsky, Minsky
advocates
Model Competence Model
Contexts of Interest Phrase-Structure
Goals
Empiricism
Shannon, Skinner, Firth,
Harris
Noisy Channel Model
N-Grams
All and Only
Minimize Prediction Error
(Entropy)
Explanatory
Descriptive
Theoretical
Applied
Linguistic Agreement & WhGeneralizations
movement
Principle-Based,
Parsing Strategies
CKY (Chart),
ATNs, Unification
Understanding
Applications Who did what to
whom
Eurospeech 2003
Collocations & Word
Associations
Forward-Backward
(HMMs), Inside-outside
(PCFGs)
Recognition
Noisy Channel Applications
25
Revival of Empiricism:
A Personal Perspective
•
As a student at MIT, I was solidly opposed to empiricism
– But that changed soon after moving to AT&T Bell Labs (1983)
•
Letter-to-Sound Rules (speech synthesis)
Letter-to-sound
rules  Dict
– Names (~1985): Letter stats  Etymology  Pronunciation video
•
•
Part of Speech Tagging (1988)
Word Associations (Hanks)
Case-based reasoning:
– Corpus-based lexicography: Empirical, but not statistical The best inference is
table lookup
• Collocations: Strong tea v. powerful computers
Lexicography
• Word Associations: bread and butter, doctor/nurse
– Contribution: adding stats
• Mutual info  collocations & word associations
• Pr(doctor…nurse) >> Pr(doctor) Pr(nurse)
•
Good-Turing Smoothing (Gale):
Statistics
– Estimate probability of something you haven’t seen (whales)
•
•
Aligning Parallel Corpora: inspired by Machine Translation (MT)
Word Sense Disambiguation (river bank v. money bank)
– Bilingual  Monolingual (Yarowsky)
•
Even if IBM’s stat-based approach fails for Machine Translation  lasting
benefit (tools, linguistic resources, academic contributions to machine
learning)
Eurospeech 2003
Played well at TMI-2002
26
Speech  Language
Shannon’s: Noisy Channel Model
Channel
Model
Language
Model
• I  Noisy Channel  O
• I΄ ≈ ARGMAXI Pr(I|O) = ARGMAXI Pr(I) Pr(O|I)
Trigram Language Model
Word
Rank
We
9
The This One Two A Three
Please In
need
7
are will the would also do
to
1
resolve
85
all
9
The This One Two A Three
Please In
of
2
The This One Two A Three
Please In
the
1
important
657
issues
14
More likely alternatives
have know do…
Application
Independent
Channel Model
Application
Input
Output
Speech Recognition
writer
rider
OCR (Optical
Character
Recognition)
all
a1l
Spelling Correction
government
goverment
document question first…
thing point to
Eurospeech 2003
27
Speech  Language
Using (Abusing) Shannon’s Noisy Channel Model:
Part of Speech Tagging and Machine Translation
• Speech
– Words  Noisy Channel  Acoustics
• OCR
– Words  Noisy Channel  Optics
• Spelling Correction
– Words  Noisy Channel  Typos
• Part of Speech Tagging (POS):
– POS  Noisy Channel  Words
• Machine Translation: “Made in America”
– English  Noisy Channel  French
Didn’t have the guts to use this slide at Eurospeech (Geneva)
Eurospeech 2003
28
Where have we been
and where are we going?
1.
Consistent progress over decades
•
2.
Moore’s Law, Speech Coding, Error Rate
History repeats itself
•
•
•
•

Empiricism: 1950s
Rationalism: 1970s
Empiricism: 1990s
Rationalism: 2010s (?)
Discontinuities: Fundamental changes that invalidate
fundamental assumptions
•
•
•
•
Petabytes: $2,000,000  $2,000
Can demand keep up with supply?
If not  Tech meltdown
New priorities: Search >> Compression & Dictation
Eurospeech 2003
29
Meeting Demand for Petabytes
• Moore’s Law  More and More Supply
– Disks, Memory, Network Bandwidth, everything…
– Petabytes are coming: $2,000,000 (today)  $2,000 (in 10 years)
• Can demand keep up?
– If not, revenues will collapse  tech meltdown
– Much worse than the Dot-Bomb…
Disruptive Discontinuity
• Ans1: no problem
– Demand has always kept up
– Pundits have never been able to explain why
• Thomas J. Watson (1943): I think there is a world market for maybe five
computers www.wikipedia.org/wiki/Thomas+J.+Watson
– But if you build it, they will come
• Ans2: big problem (prices for PCs & Networks are collapsing)
–
–
–
–
Demand is everything
Anyone (even a dot-com) can build a network,
But the challenge is to sell it
Need a killer app (more minutes on the network)
Eurospeech 2003
30
How much is a Petabyte?
(1015 bytes)
• Question from execs:
– How do I explain to a lay audience
• How much is a petabyte
• And why everyone will buy lots of them
• Wrong answer:
– 106 is a million (a floppy disk/email msg)
– 109 is a billion (a billion here, a billion there…)
– 1012 is a trillion (the US debt)
– 1015 is a zillion (= , an unimaginably large #)
Eurospeech 2003
31
How much is a Petabyte?
(1015 bytes)
• Question from execs:
– How do I explain to a lay audience
• How much is a petabyte
• And why everyone will buy lots of them
• Wrong answer:
– 106 is a million (a floppy disk/email msg)
– 109 is a billion (a billion here, a billion there…)
– 1012 is a trillion (the US debt)
– 1015 is a zillion (= , an unimaginably large #)
Eurospeech 2003
32
How much is a Petabyte?
Some more wrong answers
• Goal: create demand for a petabyte/lifetime
– ≈ 1015 bytes/100 years ≈ 18 megabytes/minute
– Text: 18,000 pages/min
– Speech:
• 317 telephone channels for 100 years per capita
• Text won’t do it
– Speech probably won’t either, but it is closer
– DVD video will (1.8 gigabytes/hour = 1.6
petabytes/lifetime), but
• Too much opportunity for compression
• Not enough demand for Picture Phone (privacy concerns)
Eurospeech 2003
33
Text won’t consume
PB/person;
Speech won’t either
(but it’s closer)
Digital Immortality:
Gordon Bell & Jim Gray (2000)
Estimated Lifetime Storage Requirements
Data-types
Per day Per Lifetime
email, papers, text
0.5 MB
15 GB
photos
2 MB
150 GB
speech
40 MB
1.2 TB
music
60 MB
5.0 TB
video-lite (200 Kb/s)
1 GB
100 TB
DVD video (4.3 Mb/s = 1.8 GB/hour)
20 GB
1 PB
Eurospeech 2003
34
Future of Tech Industry Depends On…
• Supply running into a (physical) limit
– Moore’s Law breaking down
– And little progress on compression
• Demand keeping up
Not Likely
Not Optimistic
– If we build it, they will come…
• Bell & Gray underestimating demand by a lot
– Everyone wanting lots and lots of speech
– Everyone wanting lots of video
– A miracle (the fat lady might sing…)
Not Likely
– Big progress on searching speech & video
Best Bet!
Eurospeech 2003
35
Bait and Switch Strategy
www.elsnet.org
• Bait: public Internet
– Large, sexy, available, rich hypertext structure
• Switch: as large as the web is
– There are larger & more valuable private repositories
• Private Intranets & telephone networks
– Exclusivity  Value
• No one cares about data that everyone can have
• Just as Groucho Marx doesn’t want to be in a club that…
• Strategy: Use the public Intranet to develop, test
and socialize new ways to extract value from
large linguistic repositories
– Value to society: Port solutions to private repositories
Eurospeech 2003
36
Bait and Switch Strategy
www.elsnet.org
• Bait: public Internet
– Large, sexy, available, rich hypertext structure
• Switch: as large as the web is
– There are larger & more valuable private repositories
• Private Intranets & telephone networks
– Exclusivity  Value
• No one cares about data that everyone can have
• Just as Groucho Marx doesn’t want to be in a club that…
• Strategy: Use the public Intranet to develop, test
and socialize new ways to extract value from
large linguistic repositories
– Value to society: Port solutions to private repositories
Eurospeech 2003
37
Bait and Switch Strategy
www.elsnet.org
• Bait: public Internet
– Large, sexy, available, rich hypertext structure
• Switch: as large as the web is
– There are larger & more valuable private repositories
• Private Intranets & telephone networks
– Exclusivity  Value
• No one cares about data that everyone can have
• Just as Groucho Marx doesn’t want to be in a club that…
• Strategy: Use the public Intranet to develop, test
and socialize new ways to extract value from
large linguistic repositories
– Value to society: Port solutions to private repositories
Eurospeech 2003
38
Switch: How Large is Large?
• Web  Renewed Excitement
– Large, rich hypertext structure & publicly available
– Ngram freqs  Google = 1000 * BNC
1 TB (ngram freqs) or
• Google: 100 Billion Words
• British National Corpus (BNC): 100 Million Words
Eurospeech 2003
1 PB (Gray)?
39
Switch: How Large is Large?
• Web  Renewed Excitement
– Large, rich hypertext structure & publicly available
– Ngram freqs  Google = 1000 * BNC
1 TB (ngram freqs) or
• Google: 100 Billion Words
• British National Corpus (BNC): 100 Million Words
1 PB (Gray)?
• It is often said that the web is the largest repository but…
– Changes to copyright laws could unlock vast resources:
www.lexisnexis.com
• Private Intranets and telephone networks >> Public Web
– American Telephone Network (FCC): 1 line/person
• Usage: 1 hour/day/line
• Assume 1 sec ≈ 1 word  10 Google collections/day
– Currently, Intranets (data) ≈ telephones (voice)
• But data is growing faster than voice
– AT&T networks: 1 PB/day
• Worldwide networks: tens of PB/day
Eurospeech 2003
40
Switch: How Large is Large?
• Web  Renewed Excitement
– Large, rich hypertext structure & publicly available
– Ngram freqs  Google = 1000 * BNC
1 TB (ngram freqs) or
• Google: 100 Billion Words
• British National Corpus (BNC): 100 Million Words
1 PB (Gray)?
• It is often said that the web is the largest repository but…
– Changes to copyright laws could unlock vast resources:
www.lexisnexis.com
• Private Intranets and telephone networks >> Public Web
– American Telephone Network (FCC): 1 line/person
• Usage: 1 hour/day/line
• Assume 1 sec ≈ 1 word  10 Google collections/day
– Currently, Intranets (data) ≈ telephones (voice)
• But data is growing faster than voice
– AT&T networks: 1 PB/day
• Worldwide networks: tens of PB/day
Eurospeech 2003
41
Switch: How Large is Large?
• Web  Renewed Excitement
– Large, rich hypertext structure & publicly available
– Ngram freqs  Google = 1000 * BNC
1 TB (ngram freqs) or
• Google: 100 Billion Words
• British National Corpus (BNC): 100 Million Words
1 PB (Gray)?
• It is often said that the web is the largest repository but…
– Changes to copyright laws could unlock vast resources:
www.lexisnexis.com
• Private Intranets and telephone networks >> Public Web
– American Telephone Network (FCC): 1 line/person
• Usage: 1 hour/day/line
• Assume 1 sec ≈ 1 word  10 Google collections/day
– Currently, Intranets (data) ≈ telephones (voice)
• But data is growing faster than voice
– AT&T networks: 1 PB/day
• Worldwide networks: tens of PB/day
Eurospeech 2003
42
Switch: How Large is Large?
• Web  Renewed Excitement
– Large, rich hypertext structure & publicly available
– Ngram freqs  Google = 1000 * BNC
1 TB (ngram freqs) or
• Google: 100 Billion Words
• British National Corpus (BNC): 100 Million Words
1 PB (Gray)?
• It is often said that the web is the largest repository but…
– Changes to copyright laws could unlock vast resources:
www.lexisnexis.com
• Private Intranets and telephone networks >> Public Web
– American Telephone Network (FCC): 1 line/person
• Usage: 1 hour/day/line
• Assume 1 sec ≈ 1 word  10 Google collections/day
– Currently, Intranets (data) ≈ telephones (voice)
• But data is growing faster than voice
– AT&T networks: 1 PB/day
• Worldwide networks: tens of PB/day
Eurospeech 2003
A lot of speech, but not
PB per capita
43
Bait: Use Web to Establish Excitement:
More data is better data
•
Shocking at TMI-1992 (Bob Mercer)
–
–
•
but less so a decade later (Eric Brill)
Many researchers are finding that performance improves with
corpus size, over full range of sizes that are available.
EMNLP-2002 Best paper (& CL): Using the Web to
Overcome Data Sparseness, Keller et al
–
For many tasks:
–
–
•
•
My spin
Language modelling
Predicting psycholinguistic judgements
Larger corpora (100B Google) >> Smaller corpora (100M BNC)
The rising tide of data will lift all boats!
1.
2.
3.
TREC Question Answering
Acquiring lexical resources from data
My research on adaptation
Eurospeech 2003
Google is displacing BNC
just as PCs displaced Crays
Larger market share 
More $$ for R&D 
Better Moore’s Law Time Constant
44
The rising tide of data will lift all boats!
TREC Question Answering & Google:
What is the highest point on Earth?
Eurospeech 2003
45
The rising tide of data will lift all boats!
Acquiring Lexical Resources from Data:
Dictionaries, Ontologies, WordNets, Language Models, etc.
http://labs1.google.com/sets
England
Japan
Cat
cat
France
Germany
Italy
Ireland
China
India
Indonesia
Malaysia
Dog
Horse
Fish
Bird
more
ls
rm
mv
Spain
Scotland
Belgium
Korea
Taiwan
Thailand
Rabbit
Cattle
Rat
cd
cp
mkdir
Canada
Austria
Australia
Singapore
Australia
Bangladesh
Livestock
Mouse
Human
man
tail
pwd
Eurospeech 2003
46
Rising Tide of Data Lifts all Boats
Bait: use public web to create & socialize new ideas
• More data  better results
– TREC Question Answering
• Remarkable performance: Google
and not much else
– Norvig (ACL-02)
– AskMSR (SIGIR-02)
– Lexical Acquisition
• Google Sets
– Hanks and I tried similar things
» but with tiny corpora
» which we called large
Switch: port these ideas to private repositories
Eurospeech 2003
47
Recent work
The chance of Two Noriegas is Closer to p/2 than p2:
Implications for Language Modeling, Information Retrieval and Gzip
• Standard indep models (Binomial, Multinomial, Poisson):
– Chance of 1st Noriega is p
– Chance of 2nd is also p
• Repetition is very common
– Ngrams/words (and their variant forms) appear in bursts
– Noriega appears several times in a doc, or not at all.
• Adaptation & Contagious probability distributions
• Discourse structure (e.g., text cohesion, given/new):
– 1st Noriega in a document is marked (more surprising)
– 2nd is unmarked (less surprising)
• Empirically, we find first Noriega is surprising (p≈6/1000)
– But chance of two is not surprising (closer to p/2 than p2)
• Finding a rare word like Noriega is like lightning
– We might not expect lightning to strike twice in a doc
– But it happens all the time, especially for good keywords
• Documents ≠ Random Bags of Words
Eurospeech 2003
48
Three Applications & Independence Assumptions:
No Quantity Discounts
• Compression: Huffman Coding
– |encoding(s)| = ceil(−log2 Pr(s))
– Two Noriegas consume twice as much space as one
• |encoding(s s)| = |encoding(s)| + |encoding(s)|
– No quantity discount
• Indep is the worst case: any dependencies  less H (space)
• Information Retrieval
– Score(query, doc) = ∑term in doc tf(term, doc) idf(term)
• idf(term): inverse doc freq: −log2 Pr(term) = −log2 df(term)/D
• tf(term, doc): number of instances of term in doc
– Two Noriegas are twice as surprising as one (2 idf v. idf)
– No quantity discount: any dependencies  less surprise
• Speech Recognition, OCR, Spelling Correction
– I  Noisy Channel  O
– Pr(I) Pr(O|I)
– Pr(I) = Pr(w1, w2 … wn) ≈ ∏k Pr(wk|wk-2, wk-1)
Eurospeech 2003
Log tf
smoothing
49
Interestingness Metrics:
Deviations from Independence
• Poisson (and other indep assumptions)
– Not bad for meaningless random strings
• Deviations from Poisson are clues for
hidden variables
– Meaning, content, genre, topic, author, etc.
• Analogous to mutual information (Hanks)
– Pr(doctor…nurse) >> Pr(doctor) Pr(nurse)
Eurospeech 2003
50
Eurospeech 2003
51
Eurospeech 2003
52
Eurospeech 2003
53
Poisson Mixtures: More Poissons  Better Fit
(Interpretation: Each Poisson is conditional on hidden
variables: meaning, content, genre, topic, author, etc.)
Eurospeech 2003
54
Adaptation: Three Approaches
1. Cache-based adaptation
Pr( w | ...)   Pr local ( w | ...)  (1   ) Pr global ( w | ...)
2. Parametric Models
– Poisson, Two Poisson,
Mixtures (neg binomial)
Pr( k  2 | k  1) 
1  Pr( 1)  Pr( 0 )
1  Pr( 0 )
3. Non-parametric
– Pr(+adapt1) ≡ Pr(test|hist)
– Pr(+adapt2) ≡ Pr(k≥2|k ≥1)
Eurospeech 2003
55
Positive & Negative Adaptation
• Adaptation:
– How do probabilities change as we read a doc?
• Intuition: If a word w has been seen recently
1. +adapt: prob of w (and its friends) goes way up
2. −adapt: prob of many other words goes down a little
• Pr(+adapt) >> Pr(prior) > Pr(−adapt)
Eurospeech 2003
56
Adaptation: Method 1
• Split each document
into two equal pieces:
– Hist: 1st half of doc
– Test: 2nd half of doc
Documents
containing hostages
in 1990 AP News
• Task:
– Given hist
– Predict test
test
• Compute contingency
table for each word
Eurospeech 2003
hist
638
505
557
76,787
57
Adaptation: Method 1
test
hist a
• Notation
– D = a+b+c+d (library)
– df = a+b+c (doc freq)
c
• Prior:
Pr( w  test ) 
• +adapt
D
• −adapt
Pr( w  test | w  hist ) 
d
Documents containing
hostages
in 1990 AP News
ac
Pr( w  test | w  hist ) 
b
Pr(+adapt) >> Pr(prior) > Pr(−adapt)
a
ab
+adapt prior
−adapt source
0.56 0.014 0.0069 AP 1987
c
0.56 0.015 0.0072 AP 1990
cd
0.59 0.013 0.0057 AP 1991
0.39 0.004 0.0030 AP 1993
Eurospeech 2003
58
Priming, Neighborhoods and Query Expansion
•
Priming: doctor/nurse
– Doctor in hist  Pr(Nurse in test) ↑
•
Find docs near hist (IR sense)
– Neighborhood ≡ set of words in docs
near hist (query expansion)
•
Partition vocabulary into three sets:
1. Hist: Word in hist
2. Near: Word in neighborhood − hist
3. Other: None of the above
•
Prior:
Pr( w  test ) 
ae g
D
•
+adapt
•
Near
•
Other Pr( w  test | w  other ) 
Pr( w  test | w  hist ) 
Pr( w  test | w  near ) 
a
ab
e
e f
g
gh
Eurospeech 2003
test
hist a
b
c
d
test
hist a
near e
b
f
other g
h
59
Adaptation: Hist >> Near >> Prior
• Magnitude is huge
– p/2 >> p2
– Two Noriegas are not
much more surprising
than one
– Huge quantity discounts
• Shape: Given/new
– 1st mention: marked
• Surprising (low prob)
• Depends on freq
– 2nd: unmarked
• Less surprising
• Independent of freq
– Priming:
• “a little bit” marked
Eurospeech 2003
60
Adaptation is Lexical
• Lexical: adaptation is
– Stronger for good keywords (Kennedy)
– Than random strings, function words (except), etc.
• Content ≠ low frequency
+adapt
prior
−adapt
source
word
0.27
0.012
0.0091
AP90
Kennedy
0.40
0.015
0.0084
AP91
Kennedy
0.32
0.014
0.0094
AP93
Kennedy
0.049
0.016
0.016
AP90
except
0.048
0.014
0.014
AP91
except
0.048
0.012
0.012
AP93
except
Eurospeech 2003
61
Adaptation: Method 2
• Pr(+adapt2)
Pr( k  2 | k  1) 
df 2
df 1
• dfk(w) ≡ number of
documents that
– mention word w
– at least k times
• df1(w) ≡ standard def
of document freq (df)
Eurospeech 2003
62
Pr(+adapt1) ≈ Pr(+adapt2)
Within factors of 2-3 (as opposed to 10-1000)
3rd mention
Priming
Eurospeech 2003
63
Adaptation helps more than it hurts
Hist is a great clue
• Examples of big winners (Boilerplate)
– Lists of major cities and their temperatures
– Lists of major currencies and their prices
– Lists of commodities and their prices
– Lists of senators and how they voted
• Examples of big losers
Hist is misleading
– Summary articles
– Articles that were garbled in transmission
Eurospeech 2003
64
Recent Work (with Kyoji Umemura)
• Applications: Japanese Morphology (text  words)
– Standard methods: dictionary-based
– Challenge: OOV (out of vocabulary)
– Good keywords (OOV) adapt more than meaningless fragments
• Poisson model: not bad for meaningless random strings
• Adaptation (deviations from Poisson): great clues for hidden variables
– OOV, good keywords, technical terminology, meaning, content, genre,
author, etc.
– Extend dictionary method to also look for substrings that adapt a lot
• Practical procedure for counting dfk(s) for all substrings s in
a large corpus (trigrams  million grams)
– Suffix array: standard method for computing freq and loc for all s
– Yamamoto & Church (2001): count df for all s in large corpus
• df (and many other ngram stats) for million-grams
• Although there are too many substrings s to work with (n2)
– They can be grouped into a manageable number of equiv classes (n)
– Where all substrings in a class share the same stats
– Umemura (unpublished): generalize method for dfk
• Adaptation for million-grams
Eurospeech 2003
65
Adaptation Conclusions
1. Large magnitude (p/2 >> p2); big quantity discounts
2. Distinctive shape
•
•
•
1st mention depends on freq
2nd does not
Priming: between 1st mention and 2nd
3. Lexical:
–
–
Independence assumptions aren’t bad for meaningless
random strings, function words, common first names, etc.
More adaptation for content words (good keywords, OOV)
Eurospeech 2003
66
Rising Tide of Data Lifts all Boats
Bait: use public web to create & socialize new ideas
• More data  better results
– TREC Question Answering
• Remarkable performance: Google
and not much else
– Norvig (ACL-02)
– AskMSR (SIGIR-02)
– Lexical Acquisition
• Google Sets
– We tried similar things
» but with tiny corpora
» which we called large
– Adapation
• Deviations from independence
assumptions (Poisson, multual
info) are clues for hidden
variables (content)
Switch: port these ideas to private repositories
Eurospeech 2003
67
Recommendations
Bait and Switch Strategy
• Strategy: Use the public Intranet to develop, test and socialize new
ways to extract value from large linguistic repositories
– Value to society: Port solutions to private repositories
• Research papers:
Bait
– Keep up the good work!
– There is already considerable interest in evaluation of new ideas
on corpora (public repositories)
– There will be more interest in
Switch
• How well methods port to new corpora
• How well performance scales with size
– Hopefully corpus size helps
• But of course, all the data in the world
– Will not solve all the world’s problems
– Need to understand when more data will help
• And when it is better to do something else
– Revival of Rationalism (Linguistics)
Eurospeech 2003
68
More Recommendations
Bait and Switch Strategy
• Infrastructure
Bait
– In addition to traditional public repositories (large)
• Web data, data collection efforts such as LDC
– We ought to think more about private repositories (even larger)
• Most of us do not keep voice mail for long
Switch
– But I have been using Scanmail to copy my voice mail to email
– And like many, I keep email online for a long time
• Private repositories would be much larger if
– It was more convenient to capture private data
– and there was obvious value in doing so.
• Currently, tools for public repositories (e.g., Google)
– are better than comparable tools for private data (e.g., searching email)
• Better search tools (email, speech & video)  Larger private
repositories
• New priorities (consume space)  new killer apps
– Search (consumes space) >> Dictation (data entry) & Compression
How did I find the videos at the beginning of this talk?
Eurospeech 2003
69
More realistic expectations
Summary:
Where have we been and where are we going?
• 1970s: Hot debate: knowledge v. data intensive methods
– People think about what they can afford to think about
– Data was expensive
• Only the richest industrial labs could play
• Beyond the reach of most universities
• Victor Zue dreams of having an hour of speech online (with annotations)
• 1990s: Revival of Empiricism: More data is better data!
Demonstrate
consistent
progress
over time
Oscillations
– Everyone can afford to play (but still expensive)
– Linguistic Data Consortium (LDC)  Web
– Evaluation, evaluation, evaluation  demonstrates consistent
progress over time, but not as convincingly as Moore’s Law
– Data intensive: method of choice
• Pendulum swings (too) far
• Is this progress, or is the pendulum about to swing back the other way?
• 2010s: Petabytes everywhere (be careful what you ask for)
– Big problem: Supply >> Demand  tech meltdown (??)
– No problem: Demand has always kept up  new killer apps
Disruptive
Discontinuities
• Search (consumes space) >> dictation (data entry) & compression
•
• Video >> Speech >> Text
Eurospeech 2003
70
Don’t see how to consume PB per capita
Virtual Integration:
So many places to look; so little time
•
Go to any work center and reps will be using lots of systems
–
•
Users want integration (one stop shopping)
–
•
Rapid cycle times: hours rather than years
Why are reps using so many systems?
Typical investigation
–
–
•
But large systems integration projects are expensive and risky
Virtual integration: benefits of integration without the costs
–
•
•
Sales, provisioning, maintenance, care
Log into many systems, and hope you can find something
Tedious, expensive, often unsuccessful
Typical scenario:
–
–
–
Customer calls care and expects us to find their records quickly
They don’t know how our databases are organized
May not know product(s), primary key(s), spelling of their name in
our database(s)
Eurospeech 2003
71
Fighting conflicts: Data Quality
• Databases are bound to be out-of-sync
– No single authority (database of record: DBOR)
• Multiple inconsistent views with different owners
To err is human…
– Human errors (manual data entry)
– Lack of proper database transaction semantics
• Suppliers, customers, partners, competitors & regulators  2-phase
commit + serializability
• Robustness: what is correct value of a disputed item?
– Bureaucracy (central planning)  legislate truth
• Undemocratic and fragile: Consistency ≠ Correctness
– Elections (voting)  truth emerges from consensus
• Diversity of opinion (good) v. Inconsistent databases (bad)
• Our premise: Data Publishing >> hoarding
Free Press:
Necessary for
Democracy
– Data Warehouse ≠ Roach Motel
– Access to the data  data quality
– Publish or perish: If it isn’t looked at  it isn’t any good
Eurospeech 2003
72
Where have we been
and where are we going?
1.
Consistent progress over decades
•
•
2.
Moore’s Law, Speech Coding, Error Rate
Time constant limited by: physics and/or R&D investment
History repeats itself:
•
Mark Twain; bad idea then and still a bad idea now
•
•
•
•
3.
Empiricism: 1950s
Rationalism: 1970s
Empiricism: 1990s
Rationalism: 2010s (?)
Discontinuities:
•
Fundamental changes that invalidate fundamental assumptions
•
•
•
•
Petabytes: $2,000,000  $2,000
Can demand keep up with supply?
If not  Tech meltdown
New priorities: data entry  create demand for petabytes
–
New Killer Apps: Search (creates demand) >> Compression & Dictation
Eurospeech 2003
73
Backup
Eurospeech 2003
75
Revival of Empiricism:
A Personal Perspective
•
As a student at MIT, I was solidly opposed to empiricism
–
•
But that changed soon after moving to AT&T Bell Labs (1983)
Letter-to-Sound Rules (speech synthesis)
–
–
Names (~1985): Letter stats  Etymology  Pronunciation video
NetTalk: Neural Nets video
•
•
•
•
•
Part of Speech Tagging (1988)
Word Associations (Hanks)
–
–
Collocations: Strong tea v. powerful computers
Word Associations: bread and butter, doctor/nurse
Contribution: adding stats
•
Mutual info  collocations & word associations
Statistics
Played well at
TMI-2002
Good-Turing Smoothing (Gale): est prob of something you haven’t seen (whales)
Aligning Parallel Corpora: inspired by Machine Translation (MT)
Word Sense Disambiguation (river bank v. money bank)
–
•
Lexicography
Pre case-based
reasoning:
The best inference
is table lookup
Corpus-based lexicography: Empirical, but not statistical
•
•
•
•
•
Demo: great theater  unrealistic expectations
Self-organizing systems v. empiricism
I did it, I did it, I did it, but…
Letter-to-sound
rules  Dict
Bilingual  Monolingual (Yarowsky)
Even if IBM’s stat-based approach fails for Machine Translation  lasting benefit
(tools, linguistic resources, academic contributions to machine learning)
Eurospeech 2003
76
Descargar

Slide 1