Shallow semantic parsing:
Making most of limited
training data
Katrin Erk
Sebastian Pado
Saarland University
• Frame semantics:
– “Who does what to whom” analysis:
senses and roles
– Cross-lingual appeal (Boas 2005)
• Prerequisite for use in NLP:
Automatic, robust, accurate methods for
analysis of free text
• Predominant machine learning paradigm:
Supervised classification
– Learn relation between features and classes from
training corpus; guess classes in test corpus
– Gildea and Jurafsky (2002) and many since
Frame-semantic analysis
• Step 1: Frame disambiguation
– WSD-style classification of predicate in
terms of frames
• Step 2: Role assignment
– Classification of nodes in terms of role
Frame-semantic analysis
Creeping in its shadow I reached a point whence I
could look straight through the uncurtained window.
(A. Conan Doyle, The Hound of the Baskervilles)
Problems of supervised
learning setting
• Coverage:
– lemmas may be missing
– frames may be missing
• Languages other than English:
– Training data may not be available
– Can we take advantage of existing
resources for English?
Today’s talk
• Shalmaneser: a system for automatic
frame-semantic analysis
• Unknown sense detection: dealing with
missing frames
• Annotation projection for cross-lingual
data creation
• Summary
Shalmaneser: Automatic
frame-semantic analysis
• Assignment of
– senses (frames) to predicates
– semantic roles
• Aim: easy use, for exploring
applications of frame-semantic analysis
– Input: plain text
– Syntactic
– Visualization with
SALTO tool
Shalmaneser: Automatic
frame-semantic analysis
• Semantic analysis as supervised learning tasks
– Pre-trained classifiers available for English
(FrameNet) and German (SALSA)
• Performance of English models:
– Frame assignment: accuracy 0.93, baseline 0.89
• High baseline because some senses are missing
– Role assignment:
• Role recognition F-score 0.75
• Role labeling Accuracy 0.78
– Not top-scoring, but okay.
Focus on ease of use and on flexibility.
Shalmaneser: Flexibiliby
• Processing steps linked only by interface
format: Salsa/Tiger XML (Erk & Pado 04)
– Adding a module: just needs to speak
Salsa/Tiger XML
• Model features specified in experiment file,
can be changed easily
• Adding new parser by instantiating an
interface class
• New language: only syntactic preprocessing
Today’s talk
• Shalmaneser: a system for automatic
frame-semantic analysis
• Unknown sense detection: dealing with
missing frames
• Annotation projection for cross-lingual
data creation
• Summary
Detecting unknown
word senses (frames)
• Unseen senses  normal WSD approach will
assign wrong sense
• Automatically detect senses we haven’t seen before?
Conan Doyle,
The Hound of the Baskervilles.
Syntax: Collins parser
Semantics: Shalmaneser
Unknown sense detection
as outlier detection
• Outlier detection: detect occurrences of
previously unseen events
(overview articles: Markou & Singh 2003a,b)
– training data: positive cases only.
Derive model of “normal” cases
– test data: positive and negative cases
training items
test items
A Nearest Neighbor-based
outlier detection method
• Tax and Duin (2000): simple method,
easy to implement
• Given test point and its nearest
training neighbor : Is closer to than
‘s nearest neighbor?
– Test point x, nearest training neighbor t, nearest
neighbor t’ of t, (Euclidean) distances d:
Accept x if pNN(x) is below a given threshold
Unknown sense detection:
• Evaluation (Erk NAACL 2006):
– Use FrameNet data
– Treat one sense of a lemma as pseudo-unknown
(iterate over all senses)
• Results (assignment of label “unknown”):
– Tax&Duin’s method, one lemma at a time:
Prec 0.70, Rec 0.35
– More data: all data for a frame,
not just that of one lemma
Prec 0.77, Rec 0.82
What features are important?
Best: just context words
Almost as good: features of 1, 3, 4 together
Just the subcategorization frame: high precision, low recall
Subcat frame, plus headwords of arguments: inbetween 3
and 2, but obviously too sparse
Unknown sense detection as outlier
detection: The bigger picture
• Why assume missing word senses in the
sense inventory and in the training data?
– Growing, unfinished resources, like FrameNet
– Domain-specific senses may be missing from
general-purpose sense inventories
• Outlier detection method presented here:
applicable to any resource that groups words
into senses, e.g. WordNet
• Using outlier detection to detect occurrences
of nonliteral use?
Today’s talk
• Shalmaneser: a system for automatic
frame-semantic analysis
• Unknown sense detection: dealing with
missing frames
• Annotation projection for cross-lingual
data creation
• Summary
Definitions, Role set: Language-independent
Specific, too
Predicate classes: Language-specific
For new language, induce:
Frame-semantic predicate classification
Corpus with frame-semantic annotation
Method: Annotation projection in parallel corpus
Word alignments approximate semantic equivalence
Corresponding word pairs (predicates)
Corresponding constituents
Evaluation: Study on EUROPARL corpus (De/En/Fr)
An idealised example
Peter comes home
Pierre revient à la maison
Frame-semantic classes
• Idea: For each frame, construct list of predicates in
new language occurring aligned to predicates of this
frame => FEEs for new languages
• Main obstacle: Translational divergence
– Corresponding predicates don’t evoke same frame
• Address by shallow, language-independent filtering
(Pado and Lapata AAAI 2005)
– Important: Distributional patterns
• Evaluation: Can obtain predicate classes for German
and French with precision of 65-70%
– Main remaining problem: English polysemy not covered by
Role annotations (I)
• Idea: For each sentence, transfer semantic
role annotation onto translated sentence
• Obstacle 1: Frame divergence
– Role projection only sensible if frames match
– Good news: In En-De test corpus (Pado and
Lapata HLT/EMNLP 2005), 70% of frames match
• Obstacle 2: Role divergence
– Even if frames are parallel, do roles match?
– Good news: In En-De test corpus, matching
frames show 90% role matches
• Remaining cases mostly elisions (e.g. passive)
Role annotations (II)
• Obstacle 3: Errors/omissions in automatically induced
word alignments
– Can be overcome by using bracketing information (chunks /
– Induction of cross-lingual correspondences as graph
optimisation problem (Pado and Lapata ACL 2006)
• Evaluation (all exact match F-score):
– Word-based projection: 0.50
– Constituent-based: 0.75
– Upper limit: 0.85
• Remaining errors mostly parsing-related
• Frame-semantic analysis potentially
interesting for many NLP applications
– Goal of Shalmaneser: flexible and easy-to-use
• Address incompleteness in resources
– Unknown sense detection as outlier detection
• Porting Frame Semantics to new languages
– Parallel corpora for automatic annotation

Shallow semantic parsing: Making most of limited trainig data