Speech Technologies and
VoiceXML
Chun-Feng Liao
NCCU Department of Computer Science
Intelligent Media Lab
[email protected]
Presentation Agenda
 Voice technologies Backgrounds
• ASR/TTS





Voice browsing with VoiceXML
VoiceXML architecture
VoiceXML Programming
Future of VoiceXML
Summary
Reference
 [1]Bob Edgar(2001),“The VoiceXML
Handbook” ,NY:CMP Books.
 [2]Dave Raggett(2001),”Getting started with VoiceXML
2.0”,W3C.
 [3]Sun Microsystems(1998),”Java Speech Grammar
Format Specification v1.0”,Sun Microsystems.
 [4]Chetan Sharma and Jeff
Kunins(2002),”VoiceXML:Strategies and Techniques for
Effective Voice Application Development with VoiceXML
2.0”,Wiley.
 [5]Brian Eberman,Jerry Carter,Darren Meyer,David
Goddeau(2002),”Building VoiceXML Browsers with
OpenVXI”, NY:ACM Press.
Reference
 [6]Microsoft (2002),“Speech Technology Overview ” ,
http://www.microsoft.com/speech/evaluation/techover
/
 [7] VoiceGenie Technologies Inc.(2001),”White
Paper:Speaking Freely About The VoiceGenie
VoiceXML Gateway and the VoiceXML
Interpreter”,VoiceGenie Technologies Inc.
 [8]W3C(2002),”VoiceXML Specification v2.0”,W3C.
Voice Technologies
 In the mid- to late 1990s, personal
computers started to become powerful
enough to support ASR
 The two key underlying technologies
behind these advances are speech
recognition (SR) and text-to-speech
synthesis (TTS).
Speech Recognition
Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )
Speech Synthesis
Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )
Pervasive Computing Model
 E-business has changed from client-server
model to web-centric model
 Once connect to the Internet,one can get any
information he want. But people wants more
convenient way to connect to Internet.
 Lou Gerstner,CEO of IBM:Pervasive
Computing Model is billion people interacting
with million e-business with trillion devices
interconnected.
Voice Browsing
 VoiceXML instead of HTML
 A voice browser instead of an ordinary
web browser
 Phone instead of PC.
VoiceXML Key Design Issues
 Speech Input: speech recognition and
DTMF
 Speech Output: pre-recorded audio and
synthesized speech
 Internet: XML, IP, HTTP, SSL, JavaScript
 Telephony: call transfer, data passing
W3C Voice Browser Working
Group
 Founded May 1999
 60 company members
 Mission — Standards group to prepare
and review markup languages to enable
internet-based speech applications
 http://www.w3.org/Voice
VoiceXML Forum
 Industry Group to promote VoiceXML
 550+ member companies
 Submitted VoiceXML 1.0 to W3C in
May 2000
 http://www.voicexml.org
• VoiceXML v1.0 (May 2000)
• VoiceXML Forum
• Specification submitted to the W3C
• VoiceXML v2.0
• W3C Voice Browser Working Group
• 50+ members collaborating
• Addressed 400+ change requests
VoiceXML Overview
 A language for specifying voice dialogs.
 Voice dialogs use audio prompts and text-to-speech
(TTS) for output; touch-tone keys (DTMF) and
automatic speech recognition (ASR) for input.
 Main input/output device (initially) is the phone.
 Leverages the Internet for application development
and delivery.
 Standard language enables portability.(VoiceXML統
一了Dialog描述語言)
VoiceXML Platform
Architecture
VoiceXML Platform
Architecture-1
 Telephone and Telephone network-Connects
caller’s telephone with Telephony Server
 VoiceXML Gateway
• Voice Browser
• Audio input-Speech Recognition (ASR),
Touchtone (DTMF), Audio recording.
• Audio output-Audio playback, Speech Synthesis
(TTS)
• Interface, Call Controls
VoiceXML Platform
Architecture-2
 VoiceXML Documents
•
•
•
•
Dialog and flow control
Client-side scripting (ECMAScript)
Speech Recognition grammar
Speech Synthesis pronunciation control
 Document servers(web server)
• Feeding Static VoiceXML documents or audio files.
 Application servers
• Generate VoiceXML documents dynamically.
• Server-side application logic
• Connect to Database, or database interface
Example
weather.jsp - VoiceXML and
JSP
<% user.storePreference(
“try”) %>
<form>
<block>
今天的氣溫是
<%= weather.getTemp() %>
度
</block>
</form>
VoiceXMLbrowser
<form>
<block>
今天的氣溫是25度
</block>
</form>
DB
Web server+
Servlet/JSP engine
Voice Gateway
Implementations of VoiceXML
Gateways
 In Taiwan:
• Yes Mobile
• Chunghwa Telecom Laboratories (二代語音
平台)
• eWings Technologies, Inc
 Free
• IBM VoiceServerSDK
 Open Source
• CMU:OpenVXI
[DEMO]
A Simple VoiceXML
Application
DEMO
 A Simple VoiceXML application to
introduce the department of Computer
Science .
 Exp. show that to build a corresponding
HTML version first is helpful.
Document
 A VoiceXML
document defines
one or more dialogs
 The user is always
in one dialog at any
time
 Each dialog specifies
the next dialog to
transition to using a
URL
doc1.vxml
Dialog 1
Transition: #dialog 2
Dialog 2
Transition:
http://xyz.com/doc2.vxml
Dialog
 A Dialog describes an interaction
between a user and the system
 Two kinds of dialogs: form and menu
VoiceXML Document Structure.
Form
 Form會依照Grammar的定義,持續搜集filed中的資訊。
<form>
<field name="travellers“>
<grammar mode=“voice” src=“./number.grxml”/>
input
output
<prompt>How many are travelling?</prompt>
<filled>
<submit next=”http://travel.com/order”/>
</filled>
</field>
</form>
eval
Menu
<menu id=“commands”>
What service would you like?
<choice next=“/cars”>
Car hire
</choice>
<choice next=“/hotels”> Hotel reservations </choice>
<choice next=“/news”>
Today’s news
</choice>
</menu>
 menu其實就是沒有欄位的form
 menu是一個流程控制的方式,依照user的選擇,分別傳
送到不同URL。
Submit
 Typically used to send results from
client to server
 Syntax:
<submit next=”URI” namelist=”var1
var2 ...”/>
 namelist:指定要傳到下一頁的Fields。
Submit, Example
<form>
<field name=“dest-city">
<prompt> Where do you want to go to? </prompt>
<grammar mode=“voice” src=“./cities.grxml”/>
</field>
<field name="travellers“>
<prompt> How many are travelling to <value expr="city"/>?
</prompt>
<grammar mode=“voice” src=“./number.grxml”/>
</field>
<filled>
Thank you. Your order is now being processed.
<submit next="http://travel.com/order" namelist=“dest-city
travellers"/>
</filled>
</form>
Variables
 Variables can be manipulated and referenced
•宣告: <field name="user2">
•設值: <assign name="user1"
expr=”’peter’"/>
•清除: <clear namelist="user1 user2"/>
•引用: How many are travelling to
<value expr=“dest-city”/> ?
- 引用時不用加$
Variable Scope
session
application
Session variables
are ”read-only”
variables provided
by the interpreter
context
document
dialog
Search for variable name
Scope defined by
element containing
executable content
(<block>, <filled> or
event handler)
錯誤處理:Events
 Events are used to signal ”unexpected”
situations
 Events are caught by an catch event handler
• <catch event=”com.acme.mailreader”>...</catch>
• <catch event=”nomatch noinput”>...</catch>
• Shortcut: <nomatch> is equivalent to <catch
event="nomatch">
• Other shortcuts: <noinput>, <error>
Events, Example
<field name=“dest-city">
<prompt> Where do you want to go to? </prompt>
<grammar mode=“voice” src=“./cities.grxml”/>
<nomatch>
Please say the city you want to fly to.
</nomatch>
</field>
Multimodal Web Browsing
 xHTML + VoiceXML
 SALT
[DEMO]
Multimodal Browsing
Future of the “Voice” web
and VoiceXML
Sun/SpeechWorks (1999)
W3C
VoiceXML 3?
JSML
Speech synthesis
(SSML)
JSGF
Speech reco. grammar
VoiceXML forum
(2000)
W3C (2003 in CR)
VoiceXML
1.0
VoiceXML
2.0
Speech semantics
NLP
Pronunciation lexicon [early]
Call control
[early]
Voice Browser
interoperation
[early]
Microsoft-led (2002)
SALT
Speech Application
Language Tags
Conclusion
 Speech is the most natural way for human
to communicate thus it will become an
important way in HCI.
 VoiceXML has revolutionized speech
recognition & telephony application
development & deployment.
Q&A
Backup
History of VoiceXML
Source:VoiceXML forum(http://www.voicexml.org)
Show : VoiceXML in Daily Life
Classification of Voice
Application
 Basic interactive voice response (IVR)
• Computer: “For stock quotes, press 1. For
trading, press 2. …”
• Human: (presses DTMF “1”)
 Basic speech ASR
• C: “Say the stock name for a price quote.”
• H: “Lucent Technologies”
Classification of Voice
Application
 Advanced speech ASR
• C: “Stock Services, how may I help you?”
• H: “Uh, what’s Lucent trading at?”
 “Near-natural language” ASR
• C: “How may I help you?”
• H: “Um, yeah, I’d like to get the current price of
Lucent Technologies”
• C: “Lucent is up two at sixty eight and a half.”
• H: “OK. I want to buy one hundred shares at
market price.”
• C: “…”
Speech Recognition
 Capturing speech (analog) signals
 Digitizing the sound waves, converting
them to basic language units or
phonemes,
 Constructing words from phonemes,
and contextually analyzing the words
to ensure correct spelling for words that
sound alike (such as write and right).
Speech Synthesis
 Speech Synthesis, or text-to-speech, is
the process of converting text into
spoken language.
• Breaking down the words into phonemes;
• Analyzing for special handling of text such
as numbers, currency amounts.
• Generating the digital audio for playback.
VoiceXML Gateway(detail)
Programming VoiceXML
 Writing a VoiceXML application is
programming.
 Control constructs are procedural (ifelse etc.)
 VoiceXML platform iterates through a
<form> until values for all field items
have been collected
VoiceXML System Components
PBX
Telecom boards
VoiceXML
server
Software utilities
Speech synthesis (TTS)
Speech recognition (SR)
Speech grammars
Voice Biometrics
Call
centre
VoiceXML servers
serve as integrators
of various hardware
and software
CT Integration
FIA - Form Interpretation
Algorithm
 The FIA has a main loop that repeatedly
selects a form item and then visits it
 The first (in document order) form item,
whose field item variable is undefined, is
selected
 As a result, the user is prompted for each
field item in turn
FIA – Form Example
<form>
<prompt>Where do you want to go to and how many are travelling
?</prompt>
<field name=“dest-city">
<prompt>Where do you want to go to?</prompt>
<grammar mode=“voice” src=“./cities.grxml”/>
</field>
Field item 1
<field name="travellers”>
Field item 2
<prompt>How many are travelling to your destination?</prompt>
<grammar mode=“voice” src=“./number.grxml”/>
</field>
<!-- other fields -->
</form>
if, else and elseif
<form>
...
<filled>
<if cond="travellers > 10">
Sorry, we cannot handle groups larger than 10 persons
<clear namelist="travellers"/>
<elseif cond="travellers > 5 && dest-city == 'London'"/>
Sorry, we cannot handle groups larger than 5 persons travelling to
London
<clear namelist=”city travellers"/>
<else/>
<submit next="http://travel.com/order"/>
</if>
</filled>
</form>
JSML - JSpeech Markup
Language
 Developed by Sun and SpeechWorks, as a
markup language for text-to-speech dialogs.
 Based on the Java Speech API Markup
Language
http://java.sun.com/products/javamedia/speech/
 Text annotation to provide hints to speech
synthesizers
• Aimed at making TTS speech more natural, more
understandable
JSML - JSpeech Grammar
Format
 Developed by Sun and SpeechWorks, as
a syntax for expressing speech
grammars
 Based on the Java Speech Grammar API
Grammar Format
http://java.sun.com/products/javamedia/speech/
Microsoft’s SALT
 Speech Application Language Tags
• Microsoft, Cisco, Intel, Comverse, SpeechWorks,
Philips
 A “lightweight” set of tags designed to be
used with HTML and XHTML to enable
lightweight telephony applications driven
from regular Web documents.
 Targeted at supporting multimodal access
Descargar

PowerPoint 簡報