Introduction to
Conversational Interfaces
Jim Glass ([email protected])
Spoken Language Systems Group
MIT Laboratory for Computer Science
February 10, 2003
Virtues of Spoken Language
Natural:
Flexible:
Efficient:
Economical:
Requires no special training
Leaves hands and eyes free
Has high data rate
Can be transmitted and received inexpensively
Speech interfaces are ideal for information
access and management when:
• The information space is broad and complex,
• The users are technically naive, or
• Speech is the only available modality.
Communication via Spoken Language
Human
Input
Output
Speech
Speech
Recognition
Synthesis
Computer
Text
Text
Generation
Understanding
Meaning
Components of Conversational Systems
Language
Generation
Dialogue
Management
Speech
Synthesis
Audio
Database
Speech
Recognition
Context
Resolution
Language
Understanding
Components of MIT Conversational Systems
GALAXY
Language
GENESIS
Generation
Dialogue
Management
Manager
Speech
ENVOICE
Synthesis
Audio
Hub
Speech
SUMMIT
Recognition
Database
Context
Discourse
Resolution
Language
TINA
Understanding
Segment-Based Speech Recognition
Waveform
Frame-based measurements (every 5ms)
Segment network created by interconnecting spectral landmarks
p
-
k
ax m
-
computers
er
uw dx
ao
z dh ae
that
-
t
-
talk
Probabilistic search finds most likely phone & word strings
k
Segment-Based Speech Recognition
Natural Language Understanding
Some syntactic nodes
carry semantic tags for
creating semantic frame
sentence
full_parse
Clause: DISPLAY
Topic: FLIGHT
Predicate: FROM
Topic: CITY
Name: "Boston"
Predicate: TO
Topic: CITY
Name: "Denver"
command
subject
display
show
me
topic
predicate
predicate
flight
source
destination
flight_list
from
city
to
city
flights
from
boston
to
denver
Dialogue Modeling Strategies
• Effective conversational interface must incorporate
extensive and complex dialogue modeling
• Conversational systems differ in the degree with which
human or computer takes the initiative
Computer
Initiative
• Computer maintains
tight control
• Human is highly
restricted
C: Please say the departure city.
Human
• Human takes
complete control
• Computer is
totally passive
H: I want to visit my grandmother.
• Our systems use a mixed initiative approach, where both
the human and the computer play an active role
Different Roles of Dialogue Management
• Pre-Retrieval: Ambiguous Input => Unique Query to DB
U:
C:
U:
C:
U:
C:
I need a flight from Boston to San Francisco
Did you say Boston or Austin?
Boston, Massachusetts
I need a date before I can access Travelocity
Tomorrow
Hold on while I retrieve the flights for you
Clarification
(recognition errors)
Clarification
(insufficient info)
• Post-Retrieval: Multiple DB Retrievals => Unique Response
C: I have found 10 flights meeting your specification.
When would you like to leave?
Help the user narrow
U: In the morning.
down the choices
C: Do you have a preferred airline?
U: United
C: I found two non-stop United flights leaving in the morning…
Concatenative Speech Synthesis
• Output waveform generated by concatenating segments of
pre-recorded speech corpus.
• Concatenation at phrase, word or sub-word level.
Synthesis Examples
The third ad is a 1996 black Acura Integra with 45380 miles.
The price is 8970 dollars. Please call (404) 399-7682.
labyrinth
abracadabra
obligatory
laboratory
compassion
disputed
cedar city
since
giant
since
computer
science
Continental flight 4695 from Greensboro is expected in
Halifax at 10:08 pm local time.
Multilingual Conversational Interfaces
• Adopts an interlingua approach for multilingual humanmachine interactions
• Applications:
– MuXing: Mandarin
system for weather
information
– Mokusei: Japanese
system for weather
information
– Spanish systems
are also under
development
– New speech-tospeech translation
work (Phrasebook)
Language
Language
Generation
Generation
Text-to-Speech
Text-to-Speech
Text-to-Speech
Text-to-Speech
Conversion
Conversion
Conversion
Conversion
Audio
Audio
Audio
Audio
I/O
Server
I/O
Server
Server
Server
Servers
Servers
Models
Rules
Dialogue
Dialogue
Management
Management
Hub
Application
Application
Application
Application
Back-end
Back-end
Back-end
Application
Back-end
Back-end
Discourse
Discourse
Resolution
Resolution
Speech
Speech
Recognition
Recognition
Models
Models
Models
Models
Language
Language
Understanding
Understanding
Language
Transparent
Language
Independent
Models
Models
Rules
Language
Dependent
Bilingual Jupiter Demonstration
Multi-modal Conversational Interfaces
• Typing, pointing, clicking can augment/complement speech
• A picture (or a map) is worth a thousand words
• Applications:
– WebGalaxy
– Allows typing and
clicking
– Includes mapbased navigation
– With display
– Embedded in a
web browser
– Current exhibit at
MIT Museum
SPEECH
RECOGNITION
HANDWRITING
RECOGNITION
LANGUAGE
UNDERSTANDING
GESTURE
RECOGNITION
MOUTH & EYES
TRACKING
meaning
WebGalaxy Demonstration
Delegating Tasks to Computers
• Many information related activities can be done off line
• Off-line delegation frees the user to attend to other matters
• Application: Orion system
– Task Specification: User interacts
with Orion to specify a task
“Call me every morning at 6 and
tell me the weather in Boston.”
“Send me e-mail any time between
4 and 6 p.m. if the traffic on Route
93 is at a standstill.”
– Task Execution: Orion leverages
existing infrastructure to support
interaction with humans
– Event Notification: Orion calls back
to deliver information
Audio Visual Integration
• Audio and visual signals both contain information about:
– Identity of the person: Who is talking?
– Linguistic message: What’s (s)he saying?
– Emotion, mood, stress, etc.: How does (s)he feel?
• The two channels of information
– Are often inter-related
– Are often complementary
– Must be consistent
• Integration of these cues can lead to enhanced
capabilities for future human computer interfaces
Audio Visual Symbiosis
Personal
Identity
Speaker
ID
Acoustic
Signal
Robust
ASR
Speech
Lip/Mouth
Recognition Reading
Linguistic
Message
Face
ID
Robust
Person ID
Visual
Signal
Acoustic
Visual
Paraling. Paraling.
Detection Detection
Robust
Paralinguistic
Detection
Paralinguistic
Information
Multi-modal Interfaces: Beyond Clicking
• Inputs need to be understood in the proper context
Are there any
over here?
What does he mean by “any,”
and what is he pointing at?
Does this mean
“yes,” “one,” or
something else?
• Timing information is a useful way to relate inputs
Move this one
over there
Where is she looking or
pointing at while saying
“this” and “there”?
Multi-modal Fusion: Initial Progress
• All multi-modal inputs are synchronized
– Speech recognizer generates absolute times for words
– Mouse and gesture movements generate {x,y,t} triples
– Network Time Protocol (NTP) is used for msec time resolution
• Speech understanding constrains gesture interpretation
– Initial work identifies an object or a location from gesture inputs
– Speech constrains what, when, and how items are resolved
– Object resolution also depends on information from application
Speech:
Pointing:
“Move this one over here”
(object)
time
(location)
Multi-modal Demonstration
• Manipulating planets in a
solar-system application
• Created w. SpeechBuilder
utility with small changes
• Gestures from vision
(Darrell & Demirdjien)
Summary
• Speech and language are inevitable, i.e.,
– The need for mobility and connectivity
– The miniaturization of computers
– Humans’ innate desire to speak
• Progress has been made, e.g.,
–
–
–
–
Understanding and responding in constrained domains
Incorporating multiple languages and modalities
Automation and delegation
Rapid system configuration
• Much interesting research remains, e.g.,
– Audiovisual integration
– Perceptual user interfaces
The Spoken Language Systems Group
Research
Scott Cyphers
James Glass
T.J. Hazen
Lee Hetherington
Joseph Polifroni
Shinsuke Sakai
Stephanie Seneff
Michelle Spina
Chao Wang
Victor Zue
Administrative
Marcia Davidson
Ph.D.
Edward Filisko
Karen Livescu
Alex Park
Mitchell Peabody
Ernest Pusateri
Han Shu
Min Tang
Jon Yi
Visitors
Paul Brittain
Thomas Gardos
Rita Singh
S.M.
Alicia Boozer
Brooke Cowan
John Lee
Laura Miyakawa
Ekaterina Saenko
Sy Bor Wang
M.Eng.
Chian Chu
Chia-Huo La
Jonathon Lau
Post-Doctoral
Tony Ezzat
Descargar

Conversational Interfaces - Massachusetts Institute of