The CUED Speech Group
Dr Mark Gales
Machine Intelligence Laboratory
Cambridge University Engineering Department
1. CUED Organisation
CUED: 6 Divisions
A. ThermoFluids
B. Electrical Eng
130 Academic Staff
1100 Undergrads
450 Postgrads
C. Mechanics
D. Structures
E. Management
F. Information
Engineering
Division
Control
Lab
Signal
Processing Lab
Computational
and Biological
Learning Lab
Medical
Imaging
Group
Vision
Group
Machine
Intelligence Lab
4 Staff
Bill Byrne
Mark Gales
Phil Woodland
Steve Young
Speech
Group
9 RA’s
12 PhD’s
2
2. Speech Group Overview
Funded Projects in
Recognition/Translation/Synthesis
(5-10 RAs)
PhD Projects in
Fundamental Speech Technology
Development (10-15 students)
HTK Software Tools
Development
MPhil in
Computer
Speech,
Text and
Internet
Technology
Computer
Laboratory
NLIP
Group
Computer Speech and
Language
International Community
• Primary research interests in speech processing
– 4 members of Academic Staff
– 9 Research Assistants/Associates
– 12 PhD students
3
Principal Staff and Research Interests
• Dr Bill Byrne
• Statistical machine translation
• Automatic speech recognition
• Cross-lingual adaptation and synthesis
• Dr Mark Gales
• Large vocabulary speech recognition
• Speaker and environment adaptation
• Kernel methods for speech processing
• Professor Phil Woodland
• Large vocabulary speech recognition/meta-data extraction
• Information retrieval from audio
• ASR and SMT integration
• Professor Steve Young
• Statistical dialogue modelling
• Voice conversion
4
Research Interests
 data driven techniques
 voice transformation
 HMM-based techniques
 data driven semantic processing
 statistical modelling
 statistical machine translation
 finite state transducer framework
 large vocabulary systems [Eng, Chinese, Arabic ]
 acoustic model training and adaptation
 language model training and adaptation
 rich text transcription & spoken document retrieval
 fundamental theory of statistical modelling and pattern processing
5
Example Current and Recent Projects
•
Global Autonomous Language Exploitation
– DARPA GALE funded (collab with BBN, LIMSI, ISI …)
•
HTK Rich Audio Trancription Project (finished 2004)
– DARPA EARS funded
•
CLASSIC: Computational Learning in Adaptive Systems for
Spoken Conversation
– EU (collab with Edinburgh, France Telecom,,…)
•
EMIME: Effective Multilingual Interaction in Mobile Environments
-
•
EU (collab with Edinburgh, IDIAP, Nagoya Institute of Technology … )
R2EAP: Rapid and Reliable Environment Aware Processing
-
TREL funded
Also active collaborations with IBM, Google, Microsoft, …
6
3. Rich Audio Transcription Project
New algorithms
Rich Transcript
Natural Speech
English/Mandarin
• DARPA-funded project
– Effective Affordable Reusable Speech-to-text (EARS) program
• Transform natural speech into human readable form
– Need to add meta-data to the ASR output
– For example speaker-terms/handle disfluencies
See
http://mi.eng.cam.ac.uk/research/projects/EARS/index.html
7
Rich Text Transcription
ASR Output
okay carl uh do you exercise yeah actually um i belong to a gym down here
gold’s gym and uh i try to exercise five days a week um and now and then
i’ll i’ll get it interrupted by work or just full of crazy hours you know
Meta-Data Extraction (MDE) Markup
Speaker1: / okay carl {F uh} do you exercise /
Speaker2: / {DM yeah actually} {F um} i belong to a gym down here /
/ gold’s gym / / and {F uh} i try to exercise five days a week {F um} /
/ and now and then [REP i’ll + i’ll] get it interrupted by work or just
full of crazy hours {DM you know } /
Final Text
Speaker1: Okay Carl do you exercise?
Speaker2: I belong to a gym down here, Gold’s Gym, and I try to
exercise five days a week and now and then I’ll get it
interrupted by work or just full of crazy hours.
8
4. Statistical Machine Translation
• Aim is to translate from one language to another
– For example translate text from Chinese to English
• Process involves collecting parallel (bitext) corpora
– Align at document/sentence/word level
• Use statistical approaches to obtain most probable translation
9
GALE: Integrated ASR and SMT
• Member of the AGILE team (lead by BBN)
The DARPA Global Autonomous Language Exploitation (GALE) program
has the aim of developing speech and language processing technologies to
recognise, analyse, and translate speech and text into readable English.
• Primary languages for STT/SMT: Chinese and Arabic
See
http://mi.eng.cam.ac.uk/research/projects/AGILE/index.html
10
5. Statistical Dialogue Modelling
Yu
Speech
Understanding
Au
System
Ss
P ( Au | Yu )
Su
Dialogue
Manager
Ys
Speech
Generation
As
 S s , Su 
P (Y s | A s )
Waveforms
Words/Concepts
Dialogue Acts
• Use a statistical framework for all stages
11
CLASSiC: Project Architecture
Speech Input
ASR
st
x
NLU
ut
x
DM
ht
x
at
x
NLG
wt
Context t-1
x
TTS
rt
x
1-Best Signal Selection
Speech output
Legend:
ASR: Automatic Speech recognition
NLU: Natural Language Understanding
DM: Dialogue Management
NLG: Natural Language Generation
TTS: Text To Speech
st: Input Sound Signal
ut: Utterance Hypotheses
ht: Conceptual Interpretation Hypotheses
at: Action Hypotheses
wt: Word String Hypotheses
rt: Speech Synthesis Hypotheses
X: possible elimination of hypotheses
See
http://classic-project.org
6. EMIME: Speech-to-Speech Translation
•
Personalised speech-to-speech translation
–
–
•
Cross-lingual capability
–
•
Map speaker characteristics across languages
Unified approach for recognition and synthesis
–
–
•
Learn characteristics of a users speech
Reproduce users speech in synthesis
Common statistical model; hidden Markov models
Simplifies adaptation (common to both synthesis and recognition)
Improve understanding of recognition/synthesis
See
http://emime.org
13
7. R2EAP: Robust Speech Recognition
• Current ASR performance degrades with changing noise
• Major limitation on deploying speech recognition systems
14
Project Overview
•
Aims of the project
1.
2.
3.
To develop techniques that allow ASR system to rapidly respond to
changing acoustic conditions;
While maintaining high levels of recognition accuracy over a wide
range of conditions;
And be flexible so they are applicable to a wide range of tasks and
computational requirements.
•
Project started in January 2008 – 3 year duration
•
Close collaboration with TREL Cambridge Lab.
–
–
–
–
Common development code-base – extended HTK
Common evaluation sets
Builds on current (and previous) PhD studentships
Monthly joint meetings
See
http://mi.eng.cam.ac.uk/~mjfg/REAP/index.html
15
Approach – Model Compensation
• Model compensation schemes highly effective BUT
• Slow compared to feature compensation scheme
• Need schemes to improve speed while maintaining performance
• Also automatically detect/track changing noise conditions
16
8. Toshiba-CUED PhD Collaborations
•
To date 5 Research studentships (partly) funded by Toshiba
–
–
–
•
Shared software - code transfer both directions
Shared data sets - both (emotional) synthesis and ASR
6 monthly reports and review meetings
Students and topics
Hank Liao (2003-2007): Uncertainty decoding for Noise Robust ASR
Catherine Breslin (2004-2008): Complementary System Generation and
Combination
Zeynep Inanoglu (2004-2008): Recognition and Synthesis of Emotion
Rogier van Dalen (2007-2010): Noise Robust ASR
Stuart Moore (2007-2010): Number Sense Disambiguation
•
Very useful and successful collaboration
17
9. HTK Version 3.0 Development
HTK is a free software toolkit for developing HMM-based systems
• 1000’s of users worldwide
• widely used for research by universities and industry
1989 – 1992
V1.0 – 1.4
Initial development at CUED
1993 – 1999
V1.5 – 2.3
Commercial development by Entropic
2000 – date
V3.0 – V3.4
Academic development at CUED
 Development partly funded by Microsoft and DARPA EARS Project
 Primary dissemination route for CU research output
2004 - date: the ATK Real-time HTK-based recognition system
See
http://htk.eng.cam.ac.uk
18
10. Summary
• Speech Group works on many aspects of speech processing
• Large vocabulary speech recognition
• Statistical machine translation
• Statistical dialogue systems
• Speech synthesis and voice conversion
• Statistical machine learning approach to all applications
• World-wide reputation for research
• CUED systems have defined state-of-the-art for the past decade
• Developed a number of techniques widely used by industry
• Hidden Markov Model Toolkit (HTK)
• Freely-available software, 1000’s of users worldwide
• State-of-the –art features (discriminative training, adaptation …)
• HMM Synthesis extension (HTS) from Nagoya Institute of Technology
See
http://mi.eng.cam.ac.uk/research/speech
19
Descargar

No Slide Title