Tools for
Sound, Speech, and Multimodal Interaction
Johnny Lee
05-830 Advanced UI Software
• Authoring Tools
– Recording, Playback
– SFX libraries
– Editing,Mixing
• Developer Tools
– Software APIs
– FFT libraries
Recording Sound
Most laptops have built-in mono microphones
Recording Sound
Recording Sound
Playing Sound
Most laptops have built in speakers
Multichannel Audio
• ProTools by Digidesign – up to 64
channels of 24-bit, 48Khz audio I/O
Multichannel Audio
Sound Libraries
• SoundIdeas (
– General 6000
– Hanna Barbara
• Lots of other smaller suppliers of stock
sound libraries
Editing/Mixing Sounds
• LogicAudio, SoundForge,
Peak, SoundEdit16, many
• Edits sound kind of like a texteditor.
• Sophisitcated DSP (some
• Synchronization with video and
MIDI support
• “Musical Instrument Digital Interface”
• Hardware communication layer
– 5-pin din, uni-directional with pass-thru
• Software protocol layer
MIDI Commands are 2-3 bytes
Note specification
Device configuration (128 controllers)
Device Control/Synchronization
• Lots of general purpose fields
• Simple electronics (2 resistors and PIC
• Semi-popular option for simple
control/robotics applications.
MOD files
• File size can be tiny if using a MIDI
synthesizer is used at playback time.
• Playback quality depends on the quality
of the synthesizer
• MOD files (module format) combine
MIDI data with WAV samples to produce
high quality consistent playback in a
relatively small file.
Software APIs for sound
Microsoft – DirectX 9.0
• DirectX is :
DirectDraw – 2D drawing
Direct3D – 3D drawing
DirectInput – input/haptic devices
DirectPlay – network gaming
DirectShow – video streams
DirectSound – wave audio I/O
DirectMusic – soundtrack management and MIDI
DirectSetup – DirectX installation routines
WAV capture
Multi-channel sound playback
Full duplex
3D specification of sound sources.
Some real-time DSP: Chorus,
Compression, Flange, Distortion, Echo,
• Coordinates several sound files (MIDI, wav,
etc.) into “soundtracks”.
• Sequencing (timelines, cueing, and
• Supports dynamic composition, variation, and
transitioning between songs/parts.
• Dynamic content authored in DirectMusic
• Compositions can be made with DLS (downloadable
sound) files – a cross-platform “smart” audio file
format designed for dynamic loading in interactive
• DLS = MIDI + WAV for interactive apps
MacOS X – Core Audio
MacOS X – Core Audio
• Sound Manager – routines for resource
management and play/recording sound
• AudioToolbox – sophisitcated DSP architecture,
• MIDI Services – device abstraction, control, and
• Audio HAL – medium level I/O access (real-time,
low-latency, multi-channel, floating point is standard
• IOKit – low level device access
• Drivers, Hardware - blarg
• Full Java API provided
• Basic data structures and routines for loading,
playing, and stopping sounds.
• I/O device access is somewhat limited.
• I’ve been told that synchronization is a problem in
Voice as Sound
• “Voice as sound: using non-verbal voice input for
interactive control.” Takeo Igarashi, John F. Hughes:
UIST 2001: 155-156”
• STFT, FFT analysis
• Extension to SUITEKeys
Fourier Transform(FT)
• Simple “properties” about a sound can
be gotten by looking at the data file:
duration, volume
• More interesting analysis requires some
DSP – mainly Fourier Transform.
Fourier Transform
• FT extracts the frequency content from
a given segment of audio.
Fourier Transform
Fast Fourier Transform(FFT)
• FFT is a fast computational algorithm
for doing discrete Fourier transform
• Implementations available in most
• Good reference source: Numerical
Recipes in C++
Speech (spech)
Speech Synthesis
Three categories of speech synthesizers:
• Articulatory synth - uses physical model of the physiology of
speech production and physics of sound generation in the vocal
• Formant synth - acoustic-phonetic approach to synthesis.
Applies hundreds of “filters” loosely associated to the movement of
articulators using rules.
• Concatenative synth - segmental database that reflects
the major phonological features of a language. Creates smooth
transitions and basic processing to match prosodic patterns
ATT Natural Voices
• US English, UK English, French, Spanish,
German, Korean
• Can build a new voice font from an existing
• Examples:
Male Voice
Custom UK English
Voice Font
Phoenix Semantic Frame Parser
• Center for Spoken Language Research,
University of Colorado, Boulder
• System for processing and parsing
natural language
Details and Syntax for creating
frames and networks:
Universal Speech Interfaces
Universal speech interfaces> Ronald Rosenfeld , Dan Olsen , Alex
Rudnicky> Interactions October 2001> Volume 8 Issue 6
• “In essence, we attempt to do for speech what Palm’s Graffiti™
has done for mobile text entry. “
“Speech is an ambient medium.”
“Speech is descriptive rather than referential.”
“Speech require modest physical resources.”
“Only speech will scale as digital technology progresses.”
3 Speech interaction techniques: Natural Language (NLI, NLP),
Dialog Trees, Command and Control
Universal Speech Interfaces
• “Look and Feel”::”Sound and Say”
• Universal Metaphors – familiar ways of doing
things across applications.
• Universal User Primitives – standard dialog
interaction techniques, detection, recovering from error, asking
for help, navigation, etc.
• Universal Machine Primitives – standardize
machine responses and meanings to increase user
Java Speech
• JSAPI – Java Speech API
– Speech Generation
Structure Analysis – Java Synthesis Markup Language (JSML)
Text Pre-Processing – abbreviation, acronyms, “1998”
Text-to-Phoneme Conversion
Prosody Analysis
Waveform Production
– Speech Recognition
Grammar Design - Java Speech Grammar Format (JSGF)
Signal Processing
Phoneme Recognition
Word Recognition
Result Generation
Windows .NET Speech SDK
• Basically the .NET-ified SAPI 5.1 (Speech
• Continuous Speech Recognition (US English,
Japanese, and Simplified Chinese)
• Concatenative Speech Synthesis (US English
and Simplified Chinese)
• Interface is broken into two components:
– Application Programming Interface (API)
– Device Driver Interface(DDI)
Windows .NET Speech SDK
• Speech Synthesis API
– ISpVoice::Speak(“my text”, voice);
• Speech Synthesis DDI
– Prases text into an XML doc
– Calls the TTSEngine
– Manages sound and threading details
Windows .NET Speech SDK
• Speech Recognition API
Define context
Define grammar
Request type (dictation or command/control)
Event is fired when recognized
• Speech Recognition DDI
– Interfacing and configuring the SREngine
– Manages sound and threading details.
Windows .NET Speech SDK
• Speech Application Language Tags (SALT) –
extension to HTML for speech integration in
to webpages
• Speech Recognition Grammar Specification
(SRGS) support for field parsing
• Telephony Controls – interfaces with
telephone technology to develop voice-only
MacOS X Speech
Barely changed since 1996, MacInTalk 3
US English only
Full Java API
Speech Synthesis Manager (PlainTalk)
– algorithmic voice generation
• Speech Recognition Manager
OS wide push-to-talk Command/Control
Customizable vocabulary w/scripting
Uses “Language Model” = grammar
No dictation support
Dragon Naturally Speaking
• Commercial Recognition software
– Dictation
– Command and control
• API available for developers for
application integration
• Open source speech recognizer from CMU
• Auto-builds language model/grammer&vocabulary from example
• CMU-Cambridge Statistical Language Modeling
Toolkit – semi-machine learning algorithms for
digesting a large example corpus into a usable
• Uses CMU Pronouncing Dictionary
• SphinxTrain - builds new acoustic models
– Audio recording, transcript, pronunciation
dictionary/vocabulary, phoneme list
• Manaris,B., McCauley,R., MacGyvers,V., An Intelligent Interface
for Keyboard and Mouse Control--Providing Full Access to PC
Functionality via Speech, Proceedings of 14th International
Florida AI Research Symposium (
• Developed for individuals with motor disabilities.
• Interface layer that generates keyboard and mouse events for
the OS
– Recognizes keyboard strokes/operations: backspace, function
twleve, control-alt-delete, page down, press…. release
– Recognizes mouse buttons and movement: left-click, move
down…. Stop, 2 units above clock, move to 5-18
Scott R. Klemmer , Anoop K. Sinha , Jack Chen , James A.
Landay , Nadeem> Aboobaker , Annie Wang> Proceedings of the
13th annual ACM symposium on User interface software and>
technology November 2000
• Wizard of OZ tool for prototyping speech
• Allows the developer to quicky generate a
state machine representing the possible
paths through a speech interface and stores
recorded system responses.
• Operator simulates a functional system
during evaluation by stepping through the
state machine.
• Runtime transcripts are recorded for later
Mulitmodal Interaction
Multimodal Interaction
• According to Scott – “The term ‘multi-modal
interface’ usually refers to ‘speech and
something else’ because speech alone wasn’t
good enough.”
• Though, should probably mean more than
one (simultaneous?) input modality
– Point, click, gesture, type, speak, write, touch,
look, bite, shake, think, sweat, etc… (lots of
sensing techniques).
Multimodal Interaction
• Lots of things have used them, but no
real “tools” or weren’t simultaneous.
Cohen, P.R., Cheyer, A., Wang, M., and Baeg, S.C. An open agent architecture.
AAAI 94 Spring Symposium Series on Software AgentsAAAI, (Menlo Park, CA,
1994); reprinted in Readings in Agents. MorganKaufmann, 1997, 197204.
Brad Myers, Robert Malkin, Michael Bett, Alex Waibel, BenBostwick, Robert C.
Miller, Jie Yang, Matthias Denecke, Edgar Seemann,Jie Zhu, Choon Hong Peck,
Dave Kong, Jeffrey Nichols, BillScherlis. "Flexi-modal and Multi-Machine User
Interfaces",<i>IEEE Fourth International Conference on Multimodal
Interfaces</i>,Pittsburgh, PA. October 14-16, 2002. pp. 343-348.
Multimodal Interfaces
• A common concept is “mode-ing” or
“modifying” interaction.
– Gives extra context for recognizers (e.g. point and
– Multiplies functionality of an interaction (e.g tool
stones, left/right/no click)
Rekimoto, J., Sciammarella, E. (2000) “ToolStone: effective use of physical
manipulation vocabularies of input devices”. Proceedings of the ACM
Symposium on User Interface Software and Technology, pp. 109-117,
November 2000
• Also, a need for an input interpretation layer
for widgets that can be specified in multiple

Kinedit: Affective Messages Using Dynamic Texts