Problems and Prospects
in Collecting Spoken
Language Data
Kishore Prahallad
Suryakanth V Gangashetty
B. Yegnanarayana
Raj Reddy
IIIT Hyderabad, India
Carnegie Mellon University, USA.
1
Outline
Need for digital library of audio and video
data
 Characteristics of spoken language data
 Prototype data collection

– IIIT Hyderabad
– IIT Madras
– Lessons Learnt

Proposal to collect IL data
– as a part of Jimbaker’s global project.
2
Need for Digital Library of Audio &
Video Data

Current and future data will be in audio and video formats

Current technology makes it possible to digitize and store such large
amounts of data

Collection, storage and indexing of such data makes it possible to provide
information to current and future generation

Acts as test bed for several research challenges exists in organizing,
indexing and retrieving such large data collections
– Algorithms for quick and easier access to the information present in AV
format by providing a query using text / audio / video modes
– Algorithms using multi-modal data for bio-metric authentication
– Development of multi-lingual speech synthesis and speech recognition
systems
3
Characteristics of Spoken Language Data



Message - Information to be conveyed
Speaker – Who is the speaker?
His/her background – Age, gender, literacy levels, knowledge
levels, mannerisms etc.






Emotions – Anger, sad, happy etc.
Idiolect – An individual distinctive style of speaking
Medium of transmission – Microphone, telephone, satellite etc.
Environment - party-environment, airport/station,
Language
Dialect – grammar and the vocabulary associated with a regional or
social use of a language.

Culture and civilization
– The richness of usage of vocabulary,
grammar etc, indicates the times of the language and the society.
4
Characteristics of Spoken
Language Data

How a language was spoken 25 years ago, 50 years ago, 100
years ago and beyond?

How a famous poem was recited or sung by the author?

How a particular language was spoken in different geographical
locations of a state/country?

How a particular language/dialect has evolved over a period of
time?

What were the rare languages/dialects (which were no more in
existence)?. How they were spoken?
5
Phase 0: Prototype data collection
at IIIT Hyd

High quality studio recordings
– 2 hrs of single speaker recordings for speech
synthesis
– Telugu, Hindi, Tamil and Indian-English
– Developed text to speech systems in these 4
languages

Telephone and Cell-phone corpus
– 150 hrs (540 speakers)
– Telugu, Tamil and Marathi
– Developed speech recognition systems in these 3
languages
6
Phase 0: Prototype data collection
at IIT Madras
15 hours (72 speakers)
 TV news in Tamil, Telugu and Hindi
Languages

– Text to speech systems (TTS)
– Language Identification
– Duration modeling for TTS systems
7
Tools Aiding for
Acquisition/Correction of Speech Data

Transcription correction tool (TCT)
– Spoken errors at phone, syllable, word level
– Background noise, abrupt begin or end, low SNR
– TCT corrects the above errors in three levels

Audio & Video Transcription Tool
– Used to annotate movie databases

Correction of Segment labels
– Emulabel
8
Lessons Learnt

Speech correction needs 3-6 times more
than collection
– Better to collect more data than correcting

Needs a unified framework
– Standardize, processes, procedure and tools

Need larger collection of spoken and text
corpora
– For building practical speech systems in
Indian languages
9
Proposal for collection of larger
Spoken Language Data for IL
Focus of information present in speech
mode
 Collect spoken language data from all
Indian languages and also from
neighboring countries
 Collect about 200,000 (.2 M) hours of
speech

– As a part of JimBaker’s global project of
collecting 1 Million hours of speech
10
New in our approach

Collection of large speech data upto 200,000 (0.2 M)
hours
– All Indian languages and dialects
 23 official Indian languages
 Approx. 10,000 hours per language
– All types: Traditional, Read, spoken, conversational, dialog,
movies, broadcast etc.
– All modes: microphone, clean, telephone, cellphone, satellite etc



Standard procedure for organizing, annotating and
indexing
More focus on larger collection (and elimination than of
correction)
Make available this data for general public use
11
Key Make-A-Difference Capability

Availability of information (Stories, lectures, poems, books, articles)
in spoken language
 For illiterate
 Vision Impaired

Collection and Storage of spoken language data of popular as well
as rare languages & dialects

Promotes research and development in
– Speech Technology







Speech-to-speech translation in Indian languages
Phonetic engine (Language Independent)
Speech synthesis (Text-to-speech for Indian languages)
Speaker recognition (Text independent and dependent)
Language Identification
Speech enhancement
Speech signal processing
– Biometrics:
 Multimodal: Audio-Video modes
– Information Access, Storage and Retrieval
 Audio-video data (indexing)
 Data Mining (searching)
 Speech Coding (Ultra-low bit coding)
12
Implementation Plan

Phase 1: (3.5 months)
– 10 languages
– 33,300 hours

Phase 2: (8 months)
– 10 (of phase 1) languages
– 66,000 hours

Phase 3: (10 months)
– 13 - remaining languages
– 80,000 hours
13
Mid-Term and Final Terms

Mid-Term
– Phase 1, collection of 33,300 hours of speech
– Collection, Storage and Indexing of speech data for
public information access
– Visible research output using the speech data
– Demonstrations of speech technology products
 Speech recognition in 10 languages

Final Term
– Phase 1 + Phase 2
14
Q&A
15
Misc….
16
Impact of Audio Digital Library




Availability of information in spoken language form for
illiterate and others
Promotes research in speech technology for Indian
languages
Enable to develop speech technology products useful for
common man
Examples:
– Speech-speech translation systems
 For information exchange
– Screen readers,
 For illiterate and physically challenged
– Naturally speaking dialog systems
 For information access over voice mode
17
Phase 1: Time Estimate

Phase 1:
– 10 official Indian languages
– Parallel collection of data
– ~ 3000 hours per language
 5,000 - 10,000 speakers
 > 10 min of speech each per speaker
– Total: 33,300 hours

Time Estimates: (~ 3.5 months all 10 languages)
– 10 persons-team per language
– Each person works
 8 hours a day
 30 mins of speech recording per hour
– 1-3 speakers per hour
 240 mins of speech per day
– 1-24 speakers per day,
– 240 speakers per day
– 20,000 speakers per language in 84 working days
18
Phase 1: Cost Estimate
Man power cost: Rs 140 Lakhs
 Equipment cost: Rs 55 Lakhs
 Communication cost: Rs 40 Lakhs
 Contingency (10%): Rs 25 Lakhs

Total Cost: Rs 2.6 Crores (~ $ 565,000)
19
Man-Power Cost

Data collection Team: Rs 86 lakhs
 10 (for data collection) x Rs 10 K PM
 10 (for data correction) x Rs 10 K PM
 1 data manager (Rs 15 K PM)
 4 months cost: 8, 60, 000 per language

5 engineers: Rs 4 Lakhs
– B.Tech Level (Rs 20,000 PM)

Gifts per speaker: Rs 50 Lakhs
– Rs 25 per speaker
20
Machines Cost

Machines:
– 30 servers: Rs 30 Lakhs
 3 servers per languages
 Each server has 4 ports for data collection
– 30 CTI cards: Rs 20 Lakhs

Storage: 20 TB: Rs 5 Lakhs
– Two copies of 20 TB
21
Communications Cost

Telephonic charges: Rs 20 Lakhs
– Rs 1 per min (local telephonic charges)

Transportation: Rs 20 Lakhs
22
Descargar

Document