Speech and Language Modeling
Shaz Husain
Albert Kalim
Kevin Leung
Nathan Liang
Voice Recognition
The field of Computer Science that
deals with designing computer
systems that can recognize spoken
 Voice Recognition implies only that
the computer can take dictation, not
that it understands what is being
Voice Recognition (continued)
A number of voice recognition systems
are available on the market. The most
powerful can recognize thousands of
However, they generally require an
extended training session during which
the computer system becomes
accustomed to a particular voice and
accent. Such systems are said to be
speaker dependent.
Voice Recognition (continued)
Many systems also require that the speaker
speak slowly and distinctly and separate each
word with a short pause. These systems are
called discrete speech systems.
Recently, great strides have been made in
continuous speech systems -- voice recognition
systems that allow you to speak naturally. There
are now several continuous-speech systems
available for personal computers.
Voice Recognition (continued)
Because of their limitations and high cost, voice
recognition systems have traditionally been used
only in a few specialized situations. For example,
such systems are useful in instances when the
user is unable to use a keyboard to enter data
because his or her hands are occupied or
disabled. Instead of typing commands, the user
can simply speak into a headset.
Increasingly, however, as the cost decreases and
performance improves, speech recognition
systems are entering the mainstream and are
being used as an alternative to keyboards
Natural Language Processing
Comprehending human languages falls under a different
field of computer science called natural language
Natural Language: human language. English, French, and
Mandarin are natural languages. Computer languages, such
as FORTRAN and C, are not.
Probably the single most challenging problem in Computer
Science is to develop computers that can understand
natural languages. So far, the complete solution to this
problem has proved elusive, although a great deal of
progress has been made.
Proteus Project
At New York University, members of the
Proteus Project have been doing Natural
Language Processing (NLP) research
since the 1960's.
Basic Research: Grammars and Parsers,
Translation Models, Domain-Specific
Language, Bitext Maps and Alignment,
Evaluation Methodologies, Paraphrasing,
and Predicate-Argument Structure.
Proteus Project:
Grammars and Parsers
Grammars are models of linguistic structure. Parsers are algorithms that infer
linguistic structure, given a grammar and a linguistic expression.
Given a grammar, we can design a parser to infer structure from linguistic data. Also,
given some parsed data, we can learn a grammar.
Example of Research Applications: Apple Pie Parser for English. For example, I love
an apple pie will be parsed as
(S (NP (PRP I))
(VP (VBP love)
(NP (DT an)
(NN apple)
(NN pie)))
(. -PERIOD-))
Web-based application: http://complingone.georgetown.edu/~linguist/applepie.html
Proteus Project:
Translation Models
Translation models describe the abstract/mathematical
relationship between two or more languages.
Also called models of translational equivalence because
the main thing that they aim to predict is whether
expressions in different languages have equivalent
A good translation model is the key to many trans-lingual
applications, the most famous of which is machine
Proteus Project:
Domain-specific Language
Sentences in different domains of
discourse are structurally different.
For example, imperative sentences are
common in computer manuals, but not in
annual company reports. It would be
useful to characterize these differences in
a systematic way.
Proteus Project:
Bitext Maps and Alignment
A "bitext" consists of two texts that are
mutual translations.
 A bitext map is a description of the
correspondence relation between
elements of the two halves of a bitext.
 Finding such a map is the first step to
building translation models. It is also the
first step in applications like automatic
detection of omissions in translations.
Proteus Project:
Evaluation Methodologies
There are many correct ways to say almost anything, and
many shades of meaning. This "ambiguity" of natural
languages makes the evaluation of NLP systems difficult
enough to be a research topic in itself.
Proteus Project has invented new evaluation methods in
two areas of NLP where evaluation is notoriously difficult:
translation modeling and word sense disambiguation. An
example of research applications: General Text Matcher
(GTM). GTM measures the similarity between texts.
Simple Applet for GTM: http://nlp.cs.nyu.edu/call_gtm.html
Proteus Project:
A paraphrase relation exists between two phrases which
convey the same information.
The recognition of paraphrases is an essential part of many
natural language applications: if we want to process text
reporting fact "X", we need to know all the alternative ways
in which "X" can be expressed.
Capturing paraphrases by hand is an almost overwhelming
task because they are so common and many are domain
Therefore, Project Proteus begun to develop procedures
which learn paraphrase from text. The basic idea is that
they look for news stories from the same day which report
the same event, and then examine the different ways in
which the same fact gets reported
Proteus Project:
Predicate-Argument Structure
An analysis of sentences in terms of
predicates and arguments.
 It is a "deeper" level of linguistic
analysis than constituent structure
or simple dependency structure, in
particular one that regularizes over
nearly equivalent surface strings.
Language Modeling
A bad language model
Language Modeling (continued)
Language Modeling (continued)
Language Modeling: Introduction
Language modeling
– One of the basic tasks to build a speech
recognition system
– help a speech recognizer figure out how
likely a word sequence is, independent
of the acoustics.
– lets the recognizer make the right guess
when two different sentences sound the
Basics of Language Modeling
Language modeling has been studied
under two different points of view.
– First, as a problem of grammar inference:
• the model has to discriminate the sentences which
belong to the language from those which do not
– Second, as a problem of probability estimation.
• If the model is used to recognize the decision is
usually based on the maximum a posteriori rule. The
best sentence L is chosen so that the probability of
the sentence, knowing the observations O, is
What is a Language Model
A Language model is a probability
distribution over word sequences
– P(“And nothing but the truth”)  0.001
– P(“And nuts sing on the roof”)  0
How Language Models work
Hard to compute
– P(“And nothing but the truth”)
Decompose probability
– P(“And nothing but the truth) = P(“And”)
P(“nothing|and”)  P(“but|and nothing”)
 P(“the|and nothing but”)  P(“truth|and
nothing but the”)
Types of Language Modeling
Statistical Language Modeling
 N-grams/ Trigrams Language
 Structured Language Modeling
Statistical Language Model
A statistical language model (SLM) is
a probability distribution P(s) over
strings S that attempts to reflect how
frequently a string S occurs as a
The Trigram / N-grams LM
Assume each word depends only on
the previous two/n-1 words (three
words total – tri means three, gram
means writing)
– P(“the|… whole truth and nothing but”) 
P(“the|nothing but”)
– P(“truth|… whole truth and nothing but
the”)  P(“truth|but the”)
Structured Language Models
Language has structure – noun
phrases, verb phrases, etc.
 Use structure of language to detect
long distance information
 Promising results
 But: time consuming; language is
right branching
Perplexity - is
geometric average
inverse probability
– measures language
model difficulty, not
acoustic difficulty.
– Lower the
perplexity, the
closer we are to
true model.
Language Modeling Techniques
– addresses the problem of data sparsity: there is rarely
enough data to accurately estimate the parameters of a
language model.
– gives a way to combine less specific, more accurate
information with more specific, but noisier data
– Eg. deleted interpolation and Katz (or Good-Turing)
smoothing, Modified Kneser-Ney smoothing
– is a widely used technique that uses the observation
that recently observed words are likely to occur again.
Models from recently observed data can be combined
with more general models to improve performance.
LM Techniques (continued)
Skipping models
– use the observation that even words that are not directly
adjacent to the target word contain useful information.
Sentence-mixture models
– use the observation that there are many different kinds of
sentences. By modeling each sentence type separately,
performance is improved.
– Words can be grouped together into clusters through various
automatic techniques; then the probability of a cluster can be
predicted instead of the probability of the word.
– can be used to make smaller models or better performing ones.
Finding Parameter Values
Split data into training, “heldout”, test
Try lots of different values for  on
heldout data, pick best
Test on test data
Sometimes, can use tricks like “EM”
(estimation maximization) to find values
Heldout should have (at least) 100-1000
words per parameter.
enough test data to be statistically
significant. (1000s of words perhaps)
Caching: Real Life
Someone says “I swear to tell the truth”
System hears “I swerve to smell the soup”
Cache remembers!
Person says “The whole truth”, and, with
cache, system hears “The whole soup.” –
errors are locked in.
Caching works well when users corrects
as they go, poorly or even hurts without
If you say
something, you are
likely to say it
again later.
Interpolate trigram
with cache
P(z|…rstuvwxy)  P(z|vwxy)
 Why not P(z|v_xy) – “skipping” ngram – skips value of 3-back word.
 Example: “P(time|show John a good)”
P(time | show ____ a good)
 P(…rstuvwxy)  P(z|vwxy) +
P(z|vw_y) + (1--)P(z|v_xy)
What is P(“Tuesday | party on”)
Similar to P(“Monday | party on”)
Similar to P(“Tuesday | celebration on”)
Put words in clusters:
– WEEKDAY = Sunday, Monday, Tuesday, …
– EVENT=party, celebration, birthday, …
Predictive Clustering Example
Find P(Tuesday | party on)
– Psmooth (WEEKDAY | party on) 
Psmooth (Tuesday | party on WEEKDAY)
– C( party on Tuesday) = 0
– C(party on Wednesday) = 10
– C(arriving on Tuesday) = 10
– C(on Tuesday) = 100
Psmooth (WEEKDAY | party on) is high
Psmooth (Tuesday | party on WEEKDAY) backs off
to Psmooth (Tuesday | on WEEKDAY)
Microsoft Language Modeling
Microsoft language modeling research falls into several
Language Model Adaptation. Natural language technology
in general and language models in particular are very brittle
when moving from one domain to another. Current
statistical language models are built from text specific to
newspapers and TV/radio broadcasts which has little to do
with the everyday use of language by a particular
individual. We are investigating means of adapting a
general-domain statistical language model to a new
domain/user when we have access to limited amounts of
sample data from the new domain/user.
Microsoft Language Modeling
Can Syntactic Structure Help? Current language
models make no use of the syntactic properties
of natural language but rather use very simple
statistics such as word co-occurences. Recent
results show that incorporating syntactic
constraints in a statistical language model
reduces the word erroror rate on a conventional
dictation task by 10% . We are working on finding
the best way of "putting language into language
models" as well as exploring the new possibilities
opened by such structured language models for
other tasks such as speech and language
Microsoft Language Modeling
Speech Utterance Classification A simple first step to more
natural user interfaces in interactive voice response
systems is automated call routing. Instead of listening to
prompts like "If you are trying to reach department X say
Yes, otherwise say No" or punching keys on your telephone
keypad, one could simply state in a sentence what the
problem is, for example "There is a fraudulous transaction
on my last statement" and get connected to the right
customer service representative. We are developing
technology that aims at classifying speech utterances in a
limited set of classes, enhancing the role of the traditional
language model such that it also assigns a category to a
given utterance
Microsoft Language Modeling
Building the best language models we
can. In general, the better the language
model, the lower the error rate of the
speech recognizer. By putting together the
best results available on language
modeling, we have created a language
model that outperforms a standard
baseline by 45%, leading to a 10%
reduction in error rate for our speech
recognizer. The system has the best
reported results of any language model.
Microsoft Language Modeling
Language modeling for other applications.
Speech recognition is not the only use for
language models. They are also useful in
fields like handwriting recognition,
spelling correction, even typing Chinese!
Like speech recognition, all of these are
areas where the input is ambiguous in
some way, and a language model can help
us guess the most likely input. We're also
working on finding new uses for language
models, in other areas.
Microsoft Speech Software
Development Kit
enables developers to create, debug and
deploy speech-enabled ASP.NET Web
applications intended for deployment to a
Microsoft Speech Server.
 applications are designed for devices
ranging from telephones to Windows
Mobile™-based devices and desktop PCs.
Speech Application Language Tags (SALT)
SALT is an XML based API that brings speech interactions to the
SALT is an extension of HTML and other markup languages
(cHTML, XHTML, WML) that adds a powerful speech interface to
Web pages, while maintaining and leveraging all the advantages
of the Web application model. These tags are designed to be used
for both voice-only browsers (for example, a browser accessed
over the telephone) and multimodal browsers.
SALT is a small set of XML elements, with associated attributes
and DOM object properties, events, and methods, which may be
used in conjunction with a source markup document to apply a
speech interface to the source page. The SALT formalism and
semantics are independent of the nature of the source document,
so SALT can be used equally effectively within HTML and all its
flavors, or with WML, or with any other SGML-derived markup.
What kind of applications can we
build with SALT?
SALT can be used to add speech
recognition and synthesis and
telephony capabilities to HTML or
XHTML based applications, making
them accessible from telephones or
other GUI–based devices such as
PCs, telephones, tablet PCs and
wireless personal digital assistants
XML (Extensible Markup Language)
XML is a collection of protocols for
representing structured data in a text
format that makes it straightforward to
interchange XML documents on different
computer systems.
 XML allows new markups.
 XML contains sets of data structures.
They can be transformed into appropriate
formats like XSL or XSLT.
The main top-level elements
<prompt …>
– For speech synthesis configuration and prompt playing
<listen …>
– For speech recognizer configuration, recognition
execution and post-processing, and recording
<dtmf …>
– For configuration and control of DTMF collection
<smex …>
– for general-purpose communnication with platform
The input elements <listen> and <dtmf> also
contain grammars and binding controls
<grammar …>
– For specifying input grammar resources
<bind …>
– For processing of recognition results
<record …>
– For recording audio input
Speech Library Example
Speech Library Example
<input name=”Date” type=”Dates” />
<input name=”PersonToMeet” type=”text” />
<input name=”Duration” type=”time” />
<prompt …> Schedule a meeting
<value targetElement=”Date”/> Date
<value targetElement=”Duration”/> Duration
<value targetElement=”PersonToMeet”/> Person
<listen …> <grammar …/>
<bind test=”/@confidence $lt$ 50”
targetElement=”prompt_confirm” targetMethod=”start”
targetElement=”listen_confirm” targetMethod=”start” />
<bind test=”/@confidence $ge$ 50”
targetElement=”Date” value=”//Meeting/Date”/>
targetElement=”Duration” value=”//Meeting/Duration”/>
targetElement=”PersonToMeet” value=”//Meeting/Person” /> …
Example (continued)
<rule name=”MeetingProperties”/>
<ruleref name=”Date”/>
<ruleref name=”Duration”/>
<ruleref name=”Time”/>
<ruleref name=”Person”/>
<ruleref name=”Subject”/>
.. ..
<ruleref name=”Meeting”/>
<xsl:apply-templates name=“DayOfWeek”/>
<xsl:apply-templates name=“Time”/>
<xsl:apply-templates name=“Duration”/>
<xsl:apply-templates name=“Person”/>
<l propname=”DayOfWeek”>
<p valstr=”Sun”> Sunday </p>
<p valstr=”Mon”> Monday </p>
<p valstr=”Mon”> first day </p>
.. .. ..
<p valstr=”Sat”> Saturday </p>
Voice: monday
Generates an XML element:
<DayOfWeek text=”first day”>Mon</DayOfWeek>
<I propname=“Person”>
<p valstr=“Nathan”>CEO</p>
<p valstr=“Nathan”>Nathan</p>
<p valstr=“Nathan”>boss</p>
<p valstr=“Albert”>programmer</p>
Voice: CEO, Generates:
<Person text=“CEO”>Nathan</Person>
XML Result
<calendar:meeting text=”…”>
<DateTime text=”…”>
<Time text=”…”>2:00</Time>
<Duration text=“…”>3600</Duration>
How SALT Works
– For multimodal applications, SALT can be added to a visual page to
support speech input and/or output. This is a way to speech-enable
individual controls for 'push-to-talk' form-filling scenarios, or to add
more complex mixed initiative capabilities if necessary.
– A SALT recognition may be started by a browser event such as pendown on a textbox, for example, which activates a grammar relevant to
the textbox, and binds the recognition result in the textbox.
– For applications without a visual display, SALT manages the
interactional flow of the dialog and the extent of user initiative by
using the HTML eventing and scripting model.
– In this way, the full programmatic control of client-side (or
server-side) code is available to application authors for the
management of prompt playing and grammar activation.
Sample Implementation Architecture
A Web server. This Web server generates Web pages containing
HTML, SALT, and embedded script. The script controls the dialog
flow for voice-only interactions. For example, the script defines
the order for playing the audio prompts to the caller assuming
there are several prompts on a page.
A telephony server. This telephony server connects to the
telephone network. The server incorporates a voice browser
interpreting the HTML, SALT, and script. The browser can run in a
separate process or thread for each caller. Of course, the voice
browser interprets only a subset of HTML since much of HTML
refers to GUI and is not relevant to a voice browser.
A speech server. This speech server recognizes speech, plays
audio prompts, and responses back to the user.
The client device. Clients include, for example, a Pocket PC or
desktop PC running a version of Internet Explorer capable of
interpreting HTML and SALT.
SALT Architecture
Multimodal Interactive Notepad
Mipad's speech input addresses the
defects of the handheld, such as the
struggle to wrap your hands around a
small pen and hit the tiny target known as
an on-screen keyboard.
 Some of the current limitations of speech
recognition: background noise, multiple
users, accents, and idioms, can be helped
with pen input.
What does it do?
MiPad cleverly sidesteps some of the problems of speech technology by letting the
user touch the pen to a field on the screen, directing the speech recognition engine to
expect certain types of input. The Speech group calls this technology "Tap and Talk."
If you're sending an e-mail, and you tap the "To" field with the pen before you speak,
the system knows to expect a name. It won't try to translate "Helena Bayer" into
"Hello there." The semantic information related to this field is limited, leading to a
reduced error rate.
On the other hand, if you're filling in the subject field and using free-text dictation, the
engine behind MiPad knows to expect anything. This is where the "Tap and Talk"
technology comes in handy again. If the speech recognition engine has translated
your spoken "I saw a bear," into the text "I saw a hair," you can use the stylus to tap
on the word "hair" and repeat "bear," to correct the input. This focused correction, an
evolution of the mouse pointer, is easy and painless compared to having to re-type or
repeat the complete sentence.
The "Tap and Talk" interface is always available on your MiPad device. The user can
give spontaneous commands by tapping the Command button and talking to the
handheld. You might tell your MiPad device, "I want to make an appointment," and the
MiPad will obediently bring up an appointment form for you to fill in with speech, pen,
or both.
Some Projects on Their Way
Projects for Speech Recognition
Robust techniques for speech recognition
in noisy environments
(Funded by EPSRC and Bluechip
Technologies Ltd, Belfast)
 Improved large-vocabulary speech
recognition using syllables
 Multi-modal techniques for improved
speech recognition (e.g., combining audio
and visual information)
Projects for Speech Recognition
Decision-tree unified multi-resolution
models for speech communication on
mobile devices in noisy environments
(Funded by EPSRC in collaboration.)
 Modeling Voice, Accent and Emotion in
Text to Speech Synthesis
(Funded by EPSRC, in collaboration)
 TCS Programme No 3191
(In collaboration with Bluechip
Technologies Ltd, Belfast)
Projects for Language Modeling
Development and Integration of Statistical
Speech and Language Models
(Funded by EPSRC)
 Comparison of Human and Statistical
Language Model Performance
(Funded by EPSRC)
 Improved statistical language modeling
through the use of domains
 Modeling individual words as a means of
increasing the predictive power of a
language model
Robust techniques for speech recognition in
noisy environments
(Funded by EPSRC)
Speech recognition degrades dramatically
when a mismatch occurs between training
and operating conditions.
 Mismatch due to ambient or
communications-channel noise.
 Focus on robust signal pre-processing.
 Assume knowledge about the noise or the
Robust techniques for speech recognition in
noisy environments
(Funded by EPSRC)
Frequency-band corruption
 Partial-time duration corruption
 Partial feature stream
corruption(some components are
more sensitive than others)
 Inaccurate noise-reduction
 Combinations.
Improved large-vocabulary speech
recognition using syllables
Fast bootstrapping of initial phone
models of a new language.
– Requires less training data
Generating baseforms (phonetic
spellings) for phonetic languages.
– Requires deep linguistic knowledge
Improved large-vocabulary speech
recognition using syllables
– Existing acoustic model is used to obtain
initial phone models.
• Bootstrapping through alignment of target language
• Bootstrapping through alignment of base language
speech data.
Statistical baseform generation
– Based on context-dependent decision trees
• Tree is built for each letter.
Multi-modal techniques for
improved speech recognition (e.g.,
combining audio and visual
 Focus on the problem of combining visual
cues with audio signals for the purpose of
improved automatic machine recognition.
 LVCSR – Large vocabulary continuous
speech, significant progress, yet under
controlled conditions.
 Recognition of speech utterances with
visual clues is limited to small vocabulary,
speaker dependent training and isolated
word speech.
Decision-tree unified multi-resolution
models for speech communication on
mobile devices in noisy environments
Re-configurable multi-resolution decisiontree modeling.
 Prediction of time varying spectrum of
non-stationary noise sources.
 Developing a unified model for speech
integrating features for recognition and
synthesis including speaker adaptation.
 Dynamic multi-resolution models to
mitigate the impact of distortion of lowamplitude short-duration speech.
Modeling Voice, Accent and
Emotion in Text to Speech Synthesis
Neutral Emotion
Modeling Voice, Accent and
Emotion in Text to Speech Synthesis
Bored emotion
Modeling Voice, Accent and
Emotion in Text to Speech Synthesis
Angry emotion
Modeling Voice, Accent and
Emotion in Text to Speech Synthesis
Happy emotion
Modeling Voice, Accent and
Emotion in Text to Speech Synthesis
Sad emotion
Modeling Voice, Accent and
Emotion in Text to Speech Synthesis
Frightened emotion
Basic Principles of ASR
All ASRs work in two
 Training phase
– System learns
reference patters
Recognizing phase
– Unknown input
pattern is identified
by considering the
set of references
Three major modules
Signal processing
– Transforms speech
signals into sequence
of feature vectors
Acoustic modeling
– Recognizer matches
the sequence of
observations with
subword models
Language modeling
– Recognized word is
used to construct a
Given the identities of all previous words,
a language model is a conditional
distribution on the identity of the I’th word
in a sequence.
 A trigram model, models language in a
second-order Markov process.
 It is clearly false because it makes the
computationally convenient
approximation that a word depends on
only the two previous words.
Speech Recognition
Speech recognition is all about
understanding the human speech.
 The ability to convert speech into a
sequence of words or meaning and
then into action.
 The challenge is how to achieve this
in the real world where unknown time
varying noise is a factor.
Language Modeling
To be able to provide the
probabilities of phrases occurring
within a given context.
Improve the performance of speech
recognition systems and internet
search engines.
 http://www.speech.sri.com/people/st
 http://www.asel.udel.edu/icslp/cdrom/
 http://www.cs.qub.ac.uk/~J.Ming/Html/Robust.htm
 http://www.cs.qub.ac.uk/Research/NLSPOverview
 http://www.research.ibm.com/people/l/lvsubram/p
 http://dea.brunel.ac.uk/cmsp/Proj_noise2003/obj.
 http://www.research.ibm.com/journal/
 http://murray.newcastle.edu.au/users

Speech and Language Modeling