Illinois Group in Star Challenge
PART I: visual data processing
PART II. Audio Search
Liangliang Cao, Xiaodan Zhuang
University of Illinois at Urbana-Champaign
www.beckman.uiuc.edu
What is Star Challenge?
• Competition to Develop World’s NextGeneration Multimedia Search Technology
• Hosted by the Agency for Science,
Technology and Research (A*STAR),
Singapore.
• A real-world computer vision task which
requires large amounts of computation power
www.beckman.uiuc.edu
But low rewards
56 teams
Round 1
Round 2
8 teams
7 teams
from 17 countries
Round 3
Grand Final
in Singapore
www.beckman.uiuc.edu
5 teams
But low rewards
56 teams
No rewards
No rewards
No rewards
Round 1
Round 2
8 teams
7 teams
from 17 countries
Round 3
Only one team can
win US$100,000
Grand Final
in Singapore
5 teams
No rewards
www.beckman.uiuc.edu
But we have a team with no fears…
Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao
Xiaodan, Lyon, Paritosh, Mark, Tom, Mandar
Sean, Jui-Ting, Zhen, Huazhong, Xi
www.beckman.uiuc.edu
But we have a team with no fears…
Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao
Xiaodan, Lyon, Paritosh, Mark, Tom, Mandar
Sean, Jui-Ting, Zhen, Huazhong, Xi
www.beckman.uiuc.edu
Let’s go over our experience
and stories…
www.beckman.uiuc.edu
Outlines
• Problems of Visual Retrieval
• Data
• Features
• Algorithms
• Results (first 3 rounds)
www.beckman.uiuc.edu
3 Audio Retrieval Tasks
Task
Query
Target
Metric
AT1
IPA sequence
segments that contain the
query IPA sequence
regardless of its languages
AT2
an utterance spoken
all segments that contain the
query word/phrase/sentence
regardless of its spoken
languages
by different speakers
AT3
No queries
extract all recurrent segments
which are at least 1 second in
length
Data Set
“Mean Average Precision”:
1 L  1
MAP   
L i 1  Ri





i
,
j


j 1

Ri
25 hours
monolingual
database in
round1;
13 hours
multilingual
database in
round3
F-measure
F 
1 D (1   ) Pd Rd

D d 1 Pd  Rd
Xiaodan will talk about this part……
www.beckman.uiuc.edu
3 Video Retrieval Tasks
Task
Query
Target
Criteria
Single
Image
20
queries
(short)
Video
Segs
All the
similar
Segs
“visually
similar”
VT2
Short
Video
Shot
(<10s)
20
queries
(long)
Video
Segs
All the
similar
Segs
Perceptually
Similar
VT3
Videos
with
sound
(3~10s)
Order
of 10K
Category
number
VT1
learning the
common
visual
characteristics
www.beckman.uiuc.edu
Metric
Mean Average Precision:
AP 
N
1
 P(r )  rel (r )
R r 1
1 r
P(r )   rel (i )
r i1
Classification accuracy
Data Set
20 categories,
multiple labels
possible
10 categories,
multiple labels
possible
10(20)
categories,
including one
“others”
category
100. Not-Applicable, None of the labels
101. Crowd (>10 people)
102. Building with sky as backdrop, clearly visible
103. Mobile devices including handphone/PDA
104. Flag
105. Electronic chart, e.g. stock charts, airport departure chart
106. TV chart Overlay, including graphs, text, powerpoint style
107. Person using Computer, both visible
108. Track and field, sports
109. Company Trademark, including billboard, logo
110. Badminton court, sports
111. Swimming pool, sports
112. Closeup of hand, e.g. using mouse, writing, etc
113. Business meeting (> 2 people), mostly seated down, table visible
114. Natural scene, e.g. mountain, trees, sea, no pple
115. Food on dishes, plates
116. Face closeup, occupying about 3/4 of screen, frontal or side
117. Traffic Scene, many cars, trucks, road visible
118. Boat/Ship, over sea, lake
119. PC Webpages, screen of PC visible
120. Airplane
20 VT1 Categories
www.beckman.uiuc.edu
10 Categories for VT2
201. People entering/exiting door/car
202. Talking face with introductory caption
203. Fingers typing on a keyboard
204. Inside a moving vehicle, looking outside
205. Large camera movement, tracking an object,
person, car, etc
206. Static or minute camera movement, people(s)
walking, legs visible
207. Large camera movement, panning left/right,
top/down of a scene
208. Movie ending credit
209. Woman monologue
210. Sports celebratory hug
www.beckman.uiuc.edu
5 Categories for VT3
• 101. Crowd (>10 people):
• 102. Building with sky as backdrop,
clearly visible
• 107. Person using Computer, both
visible
• 112. Closeup of hand, e.g. using
mouse, writing, etc;
• 116. Face closeup, occupying about 3/4
of screen, frontal or side
www.beckman.uiuc.edu
Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)
5 queries will be given, either in the form of IPA sequence or
waveform, and the participants are required to solve 4.
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4.
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA
sequence/waveform and video category. The participants are
required to retrieve segments of data which contains sound and
video corresponding to the given IPA sequence/waveform and
video category
respectively. 3 queries will be given and the participants are req
uired to solve 2.
www.beckman.uiuc.edu
Examples of Images
www.beckman.uiuc.edu
More samples
www.beckman.uiuc.edu
Evaluation Video Data in Round2
 31 Mpeg Videos, ~20 hours
 17289 frames for VT1 in total
 40994 frames for VT2 in total
 32508 pseudo key frames, 8486 real key
frames
www.beckman.uiuc.edu
Evaluation Video Data of Round3
 Video Files: 27 Mpeg1 files, (13 hours of
video/audio in total)
 Key frames for VT1: 10580 .jpg files
 Key frames for VT2: 64546 files in total,
including 10580 .jpg files (true key frames) +
53966 .jpg files (pseudo key frames)
 Video: 352*288
www.beckman.uiuc.edu
Computation Powers
 Work Stations in IFP
 10 Servers, 2~4 CPU each, 36CPU in total
 IFP-32 Cluster: 32 dual-core 2.8G 64bit CPU
 CSL Cluster:
 Trusted-ILLIAC: 256 nodes with dual 2.2 GHz Opterons, 2
GB of RAM and 73 GB SCSI Ultra320 disks;
 Monolith: 128 node cluster with dual Pentium III CPUs at 1
Ghz with 1.5 GB of RAM per node
 TeraGrid:
www.beckman.uiuc.edu
Time Cost for Video Tasks





Data Decompression: 15 minutes;
Video Format Conversion: 2 hours;
Video Segmentation (for VT2): 40 minutes
Sound Track Extraction: 30 minutes;
Feature Extraction:









Global Feature 2: 2 hours (c)
Global Feature 1: 2 hours (c)
Patch-based Feature1: 2 hours (c)
Patch-based Feature2 :5 hours (matlab)
Semantic Feature 1: 24 hours (matlab)
Semantic Feature 2: 3 hours (c)
Semantic Feature 3: 4 hours (c)
Motion Feature 1: 24 hours (matlab)
Motion Feature 2: 3 hours on t-Illiac
 Classifier Training:
 Classifier 1: 1 hour (on IFP cluster,25 CPU, matlab)
 Classifier 2: 20 minutes
 Classifier 3: less than 10 minutes
www.beckman.uiuc.edu
Possible Accelerations for Video
 Matlab codes to C
 Parallel computing
 GPU Acceleration:
 Patch based features
 Load time is the major issue
 Extracting all the features after one load
www.beckman.uiuc.edu
Features for Round2- VT1
 Image Features
 SIFT
 HOG
 GIST
 APC
 LBP
 Color, Texture, and etc
 Semantic Feature
www.beckman.uiuc.edu
Features for Round2-VT2
 Character Detector
 Harris corner
 morphological operations
 Optical Flow
 Lucas-Kanade on spatial intensity gradient
 Gender recognition
 SODA-boost based
 Motion History Image
 Spatial interest points
www.beckman.uiuc.edu
GUFE: Grand Unified Feature Extractor
 Designed by Dennis
 Collects features generated by team
members into one standard format
 Retrieval by Query Expansion based on NN
 Feature Normalization/Combination
 Result Visualization
www.beckman.uiuc.edu
Observations
1. Samples under the same category are more semantic
similar to each other;
2. The shot boundaries are not well defined
3. some of the key frames are not labeled correctly.
e.g., VT1 101, 103(26-141);
www.beckman.uiuc.edu
Algorithms
 Query-expansion
Input: a query image and its category number.
0. Preprocessing: compute the matching between the
evaluation and the development data
1. Expand the query image by retrieving all the images
from the development data set with the same category.
2. Search the evaluation set with the expanded query.
Output: return the top 50/20 results.
www.beckman.uiuc.edu
Algorithms
 Query-expansion
 GMM Based Approach
Motivation: using a GMM to model the distribution of patches
1. Train a UBM (Universal Background Model) based on
patches from all training images
2. MAP Estimation of the distribution of the patches belonging
to one image given UBM
3. Compute pair-wise image distance based on patch kernel
and within-class covariance normalization
3. Retrieving images based the normalized distance
www.beckman.uiuc.edu
VT1 Performance (#2 in 8)
Category
MAP
•101. Crowd (>10 people):
0.8419
•102. Building with sky as backdrop, clearly visible
0.977
•103. Mobile devices including handphone/PDA
0.028
•107. Person using Computer, both visible
•109. Company Trademark, including billboard, logo
0.2281
0.96
•112. Closeup of hand, e.g. using mouse, writing, etc;
0.4584
•113. Business meeting (> 2 people), mostly seated down, table visible
0.0644
•115. Food on dishes, plates,
0.2285
•116. Face closeup, occupying about 3/4 of screen, frontal or side
0.9783
•117. Traffic Scene, many cars, trucks, road visible,
0.2901
www.beckman.uiuc.edu
VT2 Performance(#1in8)
Category
MAP
•202. Talking face with introductory caption
0.8432
•206. Static or minute camera movement, people(s)
walking, legs visible,
0.0581
•207. Large camera movement, panning left/right,
top/down of a scene,
0.7789
•208. Movie ending credit
0.2782
•209. Woman monologue, Zhen
0.9756
www.beckman.uiuc.edu
Performance of Round3 (#1in7)
Task 2 (VT1)
Target
Estimated MAP (R=20)
101. Crowd (>10 people):
0.64
102. Building with sky as backdrop, clearly visible
1
107. Person using Computer, both visible
0.7
112. Closeup of hand, e.g. using mouse, writing, etc;
0.527
116. Face closeup, occupying about 3/4 of screen, frontal or side
1
Task3 (AT1 + VT2)
Retrieval Target
VT2 only
Video
AT1 + VT2
R=20
202, face with introductory caption
1
0.03
209, women monolog
0.35
0.1
201, People entering door
N/A
www.beckman.uiuc.edu
We are:
2nd in Audio search
4th in Video search
2nd in AV search
1nd overall
Illinois Group in Star Challenge
Part II. Audio Search
A general index / retrieval approach
leveraging on speech recognition output lattices
Experience in a real-world audio retrieval task
The Star Challenge
Experience in speech retrieval in an unknown language
www.beckman.uiuc.edu
(Audio) Information Retrieval
Problem definition
• Task Description: given a query, find the “most
relevant” segments in a database
k r u: d p r ai s ^ z
“CRUDE PRICES”
Thousands
Of
Audio files
Top N
file IDs
36
www.beckman.uiuc.edu
(Audio) Information Retrieval
“Standard” Methods
• Published Algorithms:
– EXACT MATCH: segment = argmini d(query,segmenti)
where d is the string edit distance
• Fast
– SUMMARY STATISTICS:
segment = argmaxi p(query|segmenti),
bag-of-words, no concept of “sequence”
• Good for text, e.g., google, yahoo, etc.
– TRANSFORM AND INFER:
segment = argmaxi p(query|segmenti)
 argmaxi E(count(query)|segment),
word order matters
• Flexible, but slow....
www.beckman.uiuc.edu
Example: Language-Independent
Speech Information Retrieval
Voice activity detection
Perceptual freq warping
Gaussian mixtures
Likelihood Vector
bi=p(observationt|statet=i)
Inference Algorithm: Finite State Transducer built from ASR Lattices
E(count(query|observations))
Retrieval Ranking = E(count(query|segment observations))
www.beckman.uiuc.edu
STAR Challenge Audio Retrieval Tasks
Task
Query
Target
AT1
IPA sequence
segments that contain the
query IPA sequence
regardless of its languages
AT2
an utterance
spoken by different
speakers
all segments that contain
the query
word/phrase/sentence
regardless of its spoken
languages
www.beckman.uiuc.edu
Data Set
25 hours
monolingual
database in
round1;
13 hours
multilingual
database in
round3
STAR Challenge Audio Retrieval Tasks
•
•
•
•
Genuine retrieval tasks
– without pre-defined categories
The queries are human speech or IPA
sequences
– one or multiple languages.
Queries might be only part of the speech in
the provided segments in the database
The returned hits should be ordered and
only the first 50 or 20 are submitted.
www.beckman.uiuc.edu
Feature Extraction on
Short-time windows
Spectral
Features
Speech
Recognition
Lattices
Query
FSA Generation
FSM Index
...
...
FSA Generation
FSM Index
Indexing / Query / Retrieval
All Group
Indices
FSM-based Indexing
All audio
segments
| Speech Recog. |
Audio
Archive
Group Index
..
.
FSM-based
Retrieval
Group Index
FSM-based
Query
Query Construction
Construction
Empirical
Knowledge-based
Phone confusion Phone confusion
www.beckman.uiuc.edu
Retrieval
Results
41
Automatic speech recognition
 Better performance with language-specific
systems than language-independent systems
 No inter-language mismatch between
training and testing (acoustic model,
pronunciation model, language model)
 e.g., for English data, we can use all
knowledge of English (training data,
pronunciation dictionary, language model)
 language-specific (e.g., English) word recognition
www.beckman.uiuc.edu
42
Automatic speech recognition
 Multilingual database and queries
○ We might fail to identify which particular language
the testing data is; Even if we know, we might not
have the speech recognition models for that
language.
○ We might encounter unseen languages (Tamil,
Malay…)
 Language-Independent phone-based recognition
instead of word-based recognition
www.beckman.uiuc.edu
43
Summary of corpora of different languages
www.beckman.uiuc.edu
44
Recognition results
--- All languages/datasets are not equal
www.beckman.uiuc.edu
45
Acoustic model

Spectral features:
39-dim PLP, cepstral mean/variance normalization
per speaker
Modeling: HMMs with {11,13,15,17}-mixture
Gaussians
Context-dependent language-dependent modeling
 Left context- CENTRAL PHONE +right context % language
referred to language-dependent “triphones”
 e.g., sound /A/ in different context and of different language
^’-A+b%Eng
^’-A+b’%Eng
>-A+cm%Chinese
….
www.beckman.uiuc.edu
46
Acoustic model - Context-Dependent Phones
www.beckman.uiuc.edu
Acoustic model - clustering

Categories for decision tree questions
 Right or left context
 Distinctive phone features (manner/place of articulation)
 Language identity
 Lexical stress
 Punctuation mark
www.beckman.uiuc.edu
^’-A+b%Eng
^’-A+b’%Eng
>-A+cm%Chinese
….
Language/Sequence model --- N-gram
P (W i | W i 1  W i  N 1 )

If there is a pronunciation model, a particular
sequence of context-dependent phone models
(acoustic models) can be converted to a particular
word sequence, which is modeled by N-gram.

If there is no pronunciation model, the contextdependent phone sequence is directly modeled by
N-gram.
www.beckman.uiuc.edu
49
Solutions for English Speech Retrieval

Speech recognition frontend

Features:
○
○

Perceptual Linear Predictive Cepstra + Energy
first/second orders regression coefficients
Speech recognizer
○
○
○
Triphone-clustered English speech recognizer
English acoustic model, English dictionary, English language model
Lattice generation using a Viterbi decoder
Audio
Archive
Feature Extraction on
Short-time windows
Speech
Recognition
www.beckman.uiuc.edu
Spectral
Features
Lattices
50
Using lattices as the representation of speech data
I. A compact way to represent numerous alternative
hypotheses output by the speech recognizer
•A simple example
•
a b a or b a
•Some more complex examples:
www.beckman.uiuc.edu
•Some more complex examples:
James 1995
Mangu et al. 1999
www.beckman.uiuc.edu
Using lattices as the representation of speech data
II. Enable more robust speech retrieval
• One best hypothesis by speech recognition is not
reliable enough.
• Lattices can be represented as finite state
machines, which can be used in speech retrieval
and take advantage of general weighted finite
state machine algorithms.
• Robust matching between query and audio files
www.beckman.uiuc.edu
Solutions for English Speech Retrieval

Indexing the audio archive

Input from ASR frontend
○
○



Lattices of database segments
English vocabulary
Constructing log semiring automatas for each segment
Construct FST-based segment index files
Combine segment index files into a few group index files for retrieval.
All Group
Indices
FSM-based Indexing
All audio
segments
FSA Generation
FSM Index
...
...
FSA Generation
FSM Index
www.beckman.uiuc.edu
Group Index
..
.
Group Index
54

Indexing the audio archive (example)
Speech recognition output:
Lattices for two audio files
Index for the two files
Allauzen et al., 2004
www.beckman.uiuc.edu
Solutions for English Speech Retrieval

Making FST-based queries


Provided queries each as an IPA sequence
○
Building an automata from the IPA sequence and expand each IPA arc into
alternative arcs for 'similar' IPA
○
Building a query FSA, incorporating constraints
Provided queries each as audio waveform
○
○
○
Processed by ASR frontend
Building log semiring automata
Build query FSA
www.beckman.uiuc.edu
56
Query
FSM-based
Query
Query Construction
Construction
Empirical
Knowledge-based
Phone confusion Phone confusion
Example of query expansion
(phone sequence to word FSA)
k r u: d p r ai s ^ z
“CRUDE PRICES”
www.beckman.uiuc.edu
57
Solutions for English Speech Retrieval
All Group
Indices

Retrieval using queries




Parallel retrieval in all group index files
Order retrieved segment ids
FSM-based
Retrieval
truncate/format results
Fusing results with different precision/recall tradeoff obtained by different
settings
www.beckman.uiuc.edu
58
Solutions for Multilingual Speech Retrieval
• ASR frontend
•Multi-lingual word/phone speech recognizer
•Language-independent phone recognizer
• Indexing the audio archive
–Multi-lingual index
–Language-independent index
• Making FST-based queries
–Multi-lingual queries
–Language-independent queries
• Retrieval using queries
–Same as in language-specific retrieval
www.beckman.uiuc.edu
59
Audio Retrieval
Thousands
Of
Audio files
k r u: d p r ai s ^ z
“CRUDE PRICES”
Top N
file IDs
What’s the “optimal” model set?
Language-specific phones/words
Language-independent phones
General sub-word units
Size of inventory - using only the frequent symbols is better
Data driven - selected by clustering tree
62
www.beckman.uiuc.edu
Automatic Speech Retrieval in an Unknown Language
FSM-based fuzzy matching and retrieval
Thousands
Of
Audio files
k r u: d p r ai s ^ z
“CRUDE PRICES”
Audio
Archive
Feature Extraction on
Short-time windows
Spectral
Features
Speech
Recognition
FSM-based Indexing
FSM Index
FSA Generation
...
...
FSA Generation
Query
FSM Index
Top N
file IDs
Speech in
unknown
language
Word
Triphone
Phone
Other acoustic units
Lattices
All Group
Indices
Group Index
...
FSM-based
Retrieval
Group Index
Query in
unknown
language
FSM-based
Query
Query Construction
Construction
Empirical
Knowledge-based
Phone confusion Phone confusion
Retrieval
Results
www.beckman.uiuc.edu
Represented as
?
Automatic Speech Retrieval in an Unknown Language
modeled as a special case (below) of
the cognitive process called assimilation (above).
Accommodation ?
(introducing new models)
Speech in
unknown
language
English
Phones
Spanish
Phones
Russian
Phones
Language-dependent
(LD) phones
Represented as
Language-independent (LID)
LID phone lattices
phone clusters
www.beckman.uiuc.edu
based on pairwise KL divergence
LD
models
Multilingual Subword Units in an Unknown
Language
• Language-dependent versions
of the same IPA symbols can
end up:
– in one cluster, e.g., /z/ or /t∫/
– in different clusters, e.g., /j/ or /I/
• Different IPAs may be similar:
……
Phone
recognition
Croatian for
training
Croatian
Unknown
Retrieval w/
audio query
Retrieval w/
IPA sequence
query
Croatian
Clustering gives a more compact phone set, and performs better
www.beckman.uiuc.edu
• Thank you.
www.beckman.uiuc.edu
Descargar

Title of Talk Name Date - University of Illinois at Urbana