Large-scale Knowledge Resources
in
Speech and Language Research
Mark Liberman
University of Pennsylvania
[email protected]
LKR2004
3/8/2004
Outline
• Glimpse of LKR in the U.S. landscape
• What is the relationship between
large-scale knowledge resources
and research and development
on speech and language?
• What are some needs and opportunities?
• What are the trends?
• Illustrative examples
3/8/2004
LKR2004
2
Glimpses of the U.S. LKR landscape
• DARPA research areas
– Human Language Technology
– Cognitive Information Processing
• NSF initiatives
– Digital Libraries
– ITR, Human Social Dynamics
– “terascale linguistics”
• Biomedical research:
– text, ontologies, databases, experiments
– collaborations with Japan and Europe
• Language documentation
• Web archives in many disciplines
• ...too many other things to list...
3/8/2004
LKR2004
3
What is the relationship between
large-scale knowledge resources
and research and development
on speech and language?
Speech and language R&D needs LKR
Modeling text: 104-106 words in 1975, 109-1012 words today
Modeling speech: 1-10 hours in 1975, 103-104 hours today
+ lexicons, parallel text, DBs for entity tracking, etc.
+ a thousand languages and dialects
+ history, social variation, register and genre, ...
Speech and language R&D creates LKR
see above.
3/8/2004
but also something entirely new...
LKR2004
4
Some needs and opportunities
• Standards and tools for LKR
– for creation, improvement, maintenance
– for publication, distribution, archiving
– for search, access and use
• An academic culture
that rewards production and distribution of LKR
– most LKR are a side effect
of individual and small-group research
– virtual “meta-resources” from many sources
• Part of the answer:
integrate LKR into the system of (scientific and
scholarly) publication
3/8/2004
LKR2004
5
Themes and trends
• A New Empiricism
focus on large-scale resources, because
quantity (of data) → quality (of knowledge)
• Language + Life = Meaning
something new emerges from large collections
of symbols, signals, contexts, connections
• People and machines: better together
– cognitive prosthetics
– interactive working, playing and learning
• Failure is the basis for success
if we can measure error, we can learn to improve
3/8/2004
LKR2004
6
Some illustrative examples...
3/8/2004
LKR2004
7
A famous argument
(1) Colorless green ideas sleep furiously.
(2) Furiously sleep ideas green colorless.
“. . . It is fair to assume that neither sentence (1) nor (2)
(nor indeed any part of these sentences) has ever
occurred in an English discourse. Hence, in any
statistical model for grammaticalness, these
sentences will be ruled out on identical grounds as
equally ‘remote’ from English. Yet (1), though
nonsensical, is grammatical, while (2) is not.”
Noam Chomsky, “Syntactic Structures” (1957)
3/8/2004
LKR2004
8
But is it true?
3/8/2004
LKR2004
9
43 years later
• someone finally checked...
– Pereira, “Formal grammar and information theory” (2000)
– simple “aggregate bigram model” using hidden class variables c
– with C=16, trained on ~100MW of newswire data
• the result:
"Furiously sleep green ideas colorless"
is more than 200,000 times less probable than
“Colorless green ideas sleep furiously”
3/8/2004
LKR2004
10
What changed?
• Partly:
– new models and estimation methods
– better computing resources
– more accessible data
• Mostly:
– willingness to look for solutions
– opportunities to apply them
To be fair, this kind of modeling became a real option only about 1980
Now it can be done as an undergraduate term project ...
3/8/2004
LKR2004
11
Social structure from conversation
• Human social dynamics:
model of conversational turn-taking
• U.S. Supreme Court oral arguments
• Modeling is simple and local
– one session modeled at a time (~250 turns)
– data is just sequence of (~250) speaker IDs
• Undergraduate term project in intro course
(credit to: Chris Osborn)
3/8/2004
LKR2004
12
CHIEF JUSTICE WILLIAM H. REHNQUIST: We'll hear argument next in No. 01-298, Paul Lapides v.
the Board of Regents of the University System of Georgia. Spectators are admonished, do not talk
until you get outside the courtroom. The court remains in session. Mr. Bederman.
MR. DAVID J. BEDERMAN: Mr. Chief Justice, and may it please the Court: When a State
affirmatively invokes the jurisdiction of the Federal court by removing a case, that acts as a waiver of
the State's forum immunity to Federal jurisdiction under the Eleventh Amendment. This principle ...
JUSTICE ANTONIN SCALIA: When you say as an actor in any role, does it ever intervene as a
defendant?
MR. BEDERMAN: Yes, Justice Scalia. This Court's precedents seem to indicate that wherever the
State is cast in the role of plaintiff, defendant, intervenor, or claimant, that the entry into the Federal
proceeding submits the State to the jurisdiction of the Federal court.
CHIEF JUSTICE REHNQUIST: How about the Ford Motor Company case?
MR. BEDERMAN: Well, of course, the authorization requirement in Ford Motor -- and that's the
particular holding in Ford Motor that I think is of concern to this Court -- need not be reached here
because, of course, ...
CHIEF JUSTICE REHNQUIST: So, you think a line can be drawn between the State defendant
being drawn in as a respondent or involuntarily as opposed to removing and thereby invoking
Federal jurisdiction.
+ ... 254 turns ...
3/8/2004
LKR2004
13
Two-class “aggregate bigram model”,
trained on a single one-hour argument (01-298),
highest-probability class for each speaker:
class 1 = (
chief justice william h. rehnquist
justice anthony kennedy
justice antonin scalia
justice john paul stevens
justice ruth bader ginsburg
justice sandra day o'connor
justice stephen g. breyer
)
class 2 = (
mr. david j. bederman
mr. irving l. gornstein
ms. devon orland
ms. julie c. parsley)
)
3/8/2004
LKR2004
14
So human social roles can emerge
from a trivial statistical model of speaker sequencing
in a formal setting.
and sometimes you don’t need a lot of data.
...though in this case,
it was crucial that Jerry Goldman’s Oyez Project
is publishing all Supreme Court oral arguments
(audio and transcripts)
In most cases the quantity of data is crucial:
Data quantity → knowledge quality
... and available resources
are just starting to pass a threshold
3/8/2004
LKR2004
15
A case where size matters...
• English complex nominals:
sequence of nouns and adjectives, e.g.
Volume Feeding Management Success Formula Award
• Part-of-speech string offers little help in parsing:
[ stone [ traffic
barrier ]]
[[ job
growth ] statistics ]
N
N
N
• Apparently, parsing requires “understanding”
3/8/2004
LKR2004
16
The MEDLINE corpus
• U.S. National Library of Medicine
• ~12 million references and abstracts
– biomedical journal articles
– 1966 to present
• ~109 words
3/8/2004
LKR2004
17
Parsing by counting (in MEDLINE)
[NN]N
sickle cell anemia
10561 2422
N[NN]
rat bile duct
203 22366
[NA]N
information theoretic criterion
112
5
N[AN]
monkey temporal lobe
16
10154
[AN]N
giant cell tumour
7272 1345
A[NN]
cellular drug transport
262 746
[AA]N
small intestinal activity
8723
120
A[AN]
3/8/2004
inadequate topical cooling
4
195
LKR2004
18
Parsing by counting (google hits)
[N [N N]
[[N N] N]
stone traffic barrier
338
7,010
job growth statistics
349,000 11,600
First attempt at this idea: for AT&T TTS in 1987
First real success: ~15 years later
The difference: It doesn’t really work with 107-108 tokens
It works pretty well with 109-1012 tokens
“You can observe a lot just by watching.”
-Yogi Berra
here... “You can analyze a lot just by counting.”
3/8/2004
LKR2004
19
As the SCOTUS example suggests,
“large-scale” is not just the number of words or hours.
Structure, context and external relationships
can also be crucial –
here it was the sequence of speaker identities.
Here’s a simple but compelling example
of how symbol-like structure emerges
as zebra finches practice a song...
This is research by Ofer Tchernichovski (CCNY),
Partha Mitra and others
3/8/2004
LKR2004
20
Zebra finch song learning
Ofer Tchernichovski (CCNY)
Frequency (Hz)
8
3/8/2004
LKR2004
0
Time (ms)
21
700
Song motifs vary across individuals
3/8/2004
LKR2004
22
Song imitation –
young birds imitate adults
Tutor’s song
Pupil’s song
3/8/2004
LKR2004
23
Song imitation
* Can be very accurate
* Critical period – developmental learning
* Song template – memory traces of a model
* Learning requires auditory feedback
Sensory-motor phase
Sensory phase
0
3/8/2004
20
40
60
Age(days)
LKR2004
80
100
24
Days 35 / 43 / 60:
Start training
Initially:
Social & acoustic isolation
3/8/2004
LKR2004
25
The training system
Laboratory of Animal Behavior, CCNY
3/8/2004
LKR2004
26
3/8/2004
LKR2004
27
3/8/2004
LKR2004
28
Real-time calculation of acoustic features
4 simple acoustic features with articulatory correlates:
Low
Low
-
-
Pitch
+
High
+
High
FM
Wiener entropy
Pure
tone
-
+
Noise
Spectral continuity
Low
3/8/2004
-
+
LKR2004
High
29
T he trainin g system
S ong reco gnition
S ong analysis
D atabase table
3/8/2004
LKR2004
30
3/8/2004
10972
62
0.10109444
2110.150879
-2.650181532
46.28370285
0.830607355
11042
44
0.221805096
2779.580322
-3.222234249
60.9871254
0.79437232
11136
53
0.203947186
878.0430298
-1.2962991
46.85206223
0.485266626
11465
53
0.14567025
811.8573608
-1.186548352
41.14878082
0.42596662
11521
65
0.139529422
868.633667
-1.330822468
42.92938232
0.542328238
12355
81
0.536730945
982.7991333
-2.679917574
37.7701149
0.523121655
13481
55
0.185585603
733.9207764
-2.271656036
39.42351151
0.816531181
13669
72
0.342740119
772.1679077
-2.455365419
30.38383102
0.765049458
14466
53
0.276962578
699.7897949
-2.140806913
40.342556
0.822018743
14612
Start
on
16304
47
0.078976907
1122.309326
-1.729982138
48.15994644
0.823718846
Duration
55
Mean Amp
0.143629089
Mean Pitch
769.4672852
Mean
Entropy
-1.626844049
Mean FM
34.90858841
Mean
Continuity
0.711382151
16454
76
0.216472968
769.9150391
-2.356431723
39.29466629
0.794104338
16571
54
0.52569139
687.6394043
-1.956387162
37.81315613
0.616944551
17000
58
0.135118335
864.5578613
-2.363121986
31.00643349
0.858065724
17189
51
0.124977574
752.3527222
-1.94250226
36.36558151
0.691144586
17761
58
0.144002378
1021.027527
-2.258356094
40.53672409
0.708231866
17873
47
0.066938281
1339.068604
-1.668018103
46.29984665
0.69986397
18051
38
0.066276349
1847.560913
-2.551876307
38.55633545
0.805839062
18092
81
0.200010121
2080.408936
-3.075473547
50.34065247
0.776402116
18219
66
0.335276693
858.1080933
-1.750756502
46.40740204
0.511499882
18536
69
0.261755675
890.3964233
-1.860459447
42.50422668
0.500995994
19446
46
0.15915972
993.3217773
-1.601477981
43.11263275
0.527124286
20405
51
0.193706796
800.2883911
-1.413753867
41.22149277
0.428571522
20644
65
0.24410592
802.0982666
-1.589150429
39.50386429
0.429761887
20729
61
0.166723967
901.6841431
-1.771348119
47.49161148
0.556119919
20847
51
0.198818251
852.6430664
-1.053611994
48.11198425
0.44106108
23287
68
0.178408563
784.8914185
LKR2004
-2.134843588
41.99195862
0.656920671
24243
70
0.185866207
990.8589478
39.49663925
0.763919473
-2.562700748
31
Dynamic Vocal Development maps
Duration
Mean Pitch
Mean
Entropy
Mean FM
66
802.5073242
-2.626851082
33.58778763
80
66
704.6381836
-2.524046659
27.59897423
70
53
812.2409058
-1.880394816
45.26642609
62
744.0402222
-2.562429667
34.36729431
76
1212.450928
-2.24555397
48.8947258
121
663.1687012
-2.535212278
20.65950394
61
719.1973877
-2.427448273
29.89187622
65
1119.903198
-2.556747913
45.04622269
92
980.5782471
-2.776203156
29.98022079
20
50
1089.148315
-2.479059219
29.93981934
10
70
811.1593628
-2.734509706
27.13637352
90
M ean F M
60
50
40
30
0
0
100
200
300
400
500
Du r atio n
3/8/2004
LKR2004
32
Dynamic Vocal Development (DVD) Map
of a single bird
Day 85
90
80
Day 75
Onset of
training
Day 45
60
M ean F M
Day 55
Development
Day 65
70
50
40
30
20
10
0
Day 35
0
100
20 0
30 0
4 00
500
Du r at io n
3/8/2004
LKR2004
33
3/8/2004
LKR2004
34
Language + Life = Meaning
• Text (and speech) structured by:
– conversational context
• time, place, sequence, participants, ...
– content
• types and identities of referenced entities
• explicit links (anaphora, references, hyperlinks)
• implicit links (quotation, imitation, opposition)
– other contextual data
• e.g. neurological, gene expression data in birdsong learning
• gaze, gesture, posture, physiological data in conversation
3/8/2004
LKR2004
35
A small application:
real conversational transcription
• Perfect automatic speech-to-text (STT) yields:
ew very nice yes that’s that’s the ah first car uh well my first
ownership of something major that’s cool i had to buy my
car my other car burned down so it was my first brand new
car uh-huh but i love it so i am very happy
• STT + “metadata” yields “Rich Transcription”:
3/8/2004
Speaker 1:
Very nice.
Speaker 2:
Yes. That’s my first ownership of something major.
Speaker 1:
That’s cool. I had to buy my car. My other car
burned down. It was my first brand new car.
Speaker 2:
Uh-huh.
Speaker 1:
But I love it. I am very happy.
LKR2004
36
One aspect of conversational metadata:
Diarization
Goal: Label acoustic “sources” and their attributes
– speakers, music, noise, DTMF, background events
Source | Attributes
Channel A
Speaker 1 | M
Speaker 2 | F
Music
Channel B
DTMF
Speaker 3 | M
DTMF
Noise | High
5.0
10.0
15.0
20.0
25.0
30.0
35.0
Time
3/8/2004
LKR2004
37
Interactive annotation
• Supervised learning:
human annotates, machine learns
• Unsupervised learning:
machine looks for structure in raw data
• Semi-supervised learning:
human annotates a few examples,
machine tries to generalize
• “Active learning”:
machine selects cases
that are interesting or uncertain,
asks for human judgments
• Sampling experiments
human checks machine annotation of selected cases,
apply sample confusion matrix to estimate overall statistics
3/8/2004
LKR2004
38
The cycle of interactive annotation
Hand
Annotation
Hand Correction
Automatic
annotation
Machine
Learning
(Selective) Sampling/
Labeling
3/8/2004
LKR2004
39
POS tagger
trained on WSJ
applied to MEDLINE:
3/8/2004
LKR2004
40
Same tagger,
after retraining...
(~200 MEDLINE abstracts):
3/8/2004
LKR2004
41
The key to success:
learn to measure failure...
Even a badly flawed measure can produce important gains.
3/8/2004
LKR2004
42
Arabic to English
Percent of Human
100%
89%
90%
80%
Best Research System
70%
60%
Best COTS System
58%
57%
51%
50%
2002
2003
One year of quantitative evaluation...
3/8/2004
LKR2004
43
Scoring Method
Machine Translation Score
Percent of Human = ——————————— x 100
Human Translation Score
Translation Score
=
Weighted sum of n-gram matches between
translation being scored (human or machine)
and three good reference translations
Reference translation:
The U.S. island of Guam is maintaining a high state of alert after the Guam
airport and its offices both received an e-mail from someone calling himself
the Saudi Arabian Osama bin Laden and threatening a biological/chemical
attack against public places such as the airport .
Uni-gram match
Tri-gram match
Bi-gram match
Machine translation:
The American [?] international airport and its the office all receives one calls
self the sand Arab rich business [?] and so on electronic mail , which sends
out ; The threat will be able after public place and so on the airport to start the
biochemistry attack , [?] highly alerts after the maintenance.
3/8/2004
LKR2004
44
Best System Outputs
2002
2003
insistent Wednesday may recurred
her trips to Libya tomorrow for flying
Egyptair Has Tomorrow to Resume
Its Flights to Libya
Cairo 6-4 ( AFP ) - an official
announced today in the Egyptian
lines company for flying Tuesday is
a company " insistent for flying "
may resumed a consideration of a
day Wednesday tomorrow her trips
to Libya of Security Council decision
trace international the imposed ban
comment .
Cairo 4-6 (AFP) - said an official at
the Egyptian Aviation Company
today that the company egyptair
may resume as of tomorrow,
Wednesday its flights to Libya after
the International Security Council
resolution to the suspension of the
embargo imposed on Libya.
And said the official " the institution
sent a speech to Ministry of Foreign
Affairs of lifting on Libya air , a
situation her receiving replying are
so a trip will pull to Libya a morning
Wednesday " .
" The official said that the company
had sent a letter to the Ministry of
Foreign Affairs, information on the
lifting of the air embargo on Libya,
where it had received a response,
the first take off a trip to Libya on
Wednesday morning ".
Certain are " the lines is air Libyan I
will start also in of three trips running
weekly to Cairo in the coordination
with Egypt for flying " .
The Libyan Arab Airways will also in
the conduct of the three times a
week in Cairo in coordination with
egyptair ".
3/8/2004
LKR2004
45
Human v. Machine
Human
2003
Egypt Air May Resume its Flights to
Libya Tomorrow
Egyptair Has Tomorrow to Resume
Its Flights to Libya
Cairo, April 6 (AFP) - An Egypt Air
official announced, on Tuesday, that
Egypt Air will resume its flights to
Libya as of tomorrow, Wednesday,
after the UN Security Council had
announced the suspension of the
embargo imposed on Libya.
Cairo 4-6 (AFP) - said an official at
the Egyptian Aviation Company
today that the company egyptair
may resume as of tomorrow,
Wednesday its flights to Libya after
the International Security Council
resolution to the suspension of the
embargo imposed on Libya.
The official said that, "the company
sent a letter to the Ministry of
Foreign Affairs to inquire about the
lifting of the air embargo on Libya,
and in the event that it receives a
response, then the first flight to
Libya, will take off, Wednesday
morning."
" The official said that the company
had sent a letter to the Ministry of
Foreign Affairs, information on the
lifting of the air embargo on Libya,
where it had received a response,
the first take off a trip to Libya on
Wednesday morning ".
He stressed that "the Libyan Airlines
will begin scheduling three weekly
flights to Cairo, in coordination with
Egypt air."
3/8/2004
The Libyan Arab Airways will also in
the conduct of the three times a
week in Cairo in coordination with
egyptair ".
LKR2004
46
Summary
• Speech and Language Research
–
–
–
–
needs LKR
creates LKR
can help other disciplines deal with LKR
is helped by other disciplines, who provide
• raw data as well as relevant LKR pieces
• problems, algorithms, inspiration
• The whole is greater than the sum of the parts
– Types, sources and amounts of data
– Collaboration within and across disciplines
– Cooperation of humans and machines
3/8/2004
LKR2004
47
Descargar

Large-scale Knowledge Resources in Speech and Language