Extraction of Ontological
Information from Corpora
(and Lexicon)
Dimitrios Kokkinakis
[email protected]
Maria Toporowska Gronostaj
[email protected]
1
Oslo, 14-16 Sep 2003
Outline

Goals & Observations, Resources

Related Research

Extending the Coverage of Semantic Resources
(S-SIMPLE: Quality but not Quantity)
–

Why and How?
Key Issues Investigated for the Acquisition
–
–
Compounding vs. Syntactic Parsing & Large Corpora vs. Defining Lexicons
Pilot study regarding lexico-syntactic patterns

Enhancement
– What has been achieved?

Error Analysis
– For parts of the studies…

Conclusions & Future Plans
2
Oslo, 14-16 Sep 2003
Goals

Extend & enrich the coverage of the Swedish semantic
lexicon:
– as automatically as possible
– as inexpensive as possible (using whatever support was available)
– re-using lexical resources (not neccessarily semantic)

Test ideas regarding:
–
–
–
–
–
–
context similarity
similarity in NPs of Enumerative Type (+ evaluation)
- breadth
the power of compounds
- breadth
bootstrapping the SIMPLE content
using lexico-syntactic patterns for hyper/hypo relations - depth
(statistical means)
research conducted 00-01
3
Oslo, 14-16 Sep 2003
Observations
& Hypotheses






Observation-1:
Take into account the compounding characteristic of Swedish
– + easier to identify (cmp to English-at least in raw text)
– - harder to segment/analyse (cmp to English)
– + a lot of disambiguated compounds in our lexical DB
Observation-2:
Yet another view of context similarity (see Related Research)
Members of a semantic group are often surrounded by other
members of the same group in text; in other words: words
entering into the same syntagmatic relation with other words
can be perceived as to be semantically similar
Observation-3:
Apply lexico-syntactic patterns á la Hearst for more complex relations
(pilot…) – why? because during the previous 2 steps (see later
discussion) we mainly extract synonymic/co-hyponymic entries
4
Oslo, 14-16 Sep 2003
Resources
 Core SIMPLE lexicon
–
–
–
–
10,000 semantic units ( 6,000 words)
a vital part of the different entries' semantic unit is the notion
of semantic class whose value is an element in a semantic
class list (95 classes) hierarchically structured (LexiQuest)
content: high quality; manually compiled and verified, but…
limited vocabulary - quantitatively insufficient for HLT
 Gothenburg Lexical DataBase (GLDB)
–
–
–
ca 70,000 lexical entries
monolingual defining lexicon – for human readers (but + RDB-format)
advantage (particularly for this study): a number of synonymic
compounds
 Corpora
–
5
ca 40 mil. tokens (syntactically analysed)
Oslo, 14-16 Sep 2003
Related Research
(1)




context similarity plays and important role in word
acquisition
… so, common characteristic of most approaches
is the computation of the semantic similarity
between two words on the basis of the extent to
which words' average contexts of use overlap
usual assumption: members of the same semantic
group co-occur in discourse [cf. Riloff&Sheperd, 97]
use of syntax for generating semantic knowledge
based on distributional evidence & syntagmatic
relations is found in most previous research
6
Oslo, 14-16 Sep 2003
Related Research
(2)

Approaches in general – steps:
– Extract word co-occurrences (most crucial part)
usually gathered based on certain relations, e.g. predicate-argument
modifier-modified, adjacency,…
– Define similarities between words on the basis
of co-occurrences (+linguistic knowledge)
combine existing linguistic knowledge (seed lex.) & co-occur. data
for compensating the sparseness of the co-occ. data
– Cluster words on the basis of similarities
e.g. by using the contexts of the words as features and group
together the words that tend to appear in similar context
7
Oslo, 14-16 Sep 2003
Related Research
(3a)

Hearst (1992): lexico-syntactic patterns – discovered by
observation - for extracting hyponymy relations from corpora
– e.g. NP {,NP}* {,} and other NP
temples, treasuries and other important civic buildings

Grefenstette (1994): extract corpus-specific semantics in
parsed text using (weighted) Jaccard (between two objects m and n
is the num. of shared attributes divided by the number of attributes
in the unique union of the set of attributes for each object) e.g.
comparing ‘dog‘ & ‘cat‘ via textually derived attributes and binary
Jaccard measure
dog/pet-DOBJ dog/eat-SBJ dog/brown dog/shaggy dog/leash
cat/pet-DOBJ cat/pet-DOBJ cat/hairy cat/leash
leash
count({attribs pet-DOBJshared by cat and dog})/count({uniq attribs
possesed by cat or dog}) 2/6=0,333
–
–
=
8
brown
eat
hairy
leash
pet-DOBJ
shaggy
Oslo, 14-16 Sep 2003
Related Research
(3b)
Lin (1998): constructing a thesaurus using syntactically parsed corpora

containing dependency triples: ||word1 relation word2||frequency; word
similarity measure is defined based on the distributional pattern of words
(“the similarity between 2 objects is defined to be the amount of
information contained in the commonality between the objects divided
by the amount of information in the descriptions of the objects”)
e.g.: ||cell, pobj-of, inside||=16 (dependeny triple=2 words+gram. relation)
I(w,r,w’) the amount
of info in ||w,r,w’||
I(w,r,w’)=log (||w,r,w’||x||*,r,*||)/(||w,r,*||x||*,r,w’||)
similarity between 2 words (w1,w2) is based on:
((r,w)T(w1)T(w2) (I(w1,r,w)+/(w2,r,w)) /
((r,w)T(w1) I(w1,r,w)+ (r,w)T(w2) I(w2,r,w))
Roark & Charniak (1998): noun-phrase co-occurrence statistics (actually

bigrams ranked by log-likelihood) for semi-automatic semantic lexicon
construction; input is a parsed corpus and initial seed words
(= the
most frequent head nouns in a corpus [top200-500]) – based on conjunctions
cars and trucks, lists planes, trains and automobiles,
appositives and noun
compounds pickup truck
9
Oslo, 14-16 Sep 2003
Related Research
(3c)


Takunaga et al. (1997): new words (nouns) are classified
on the basis of relative probabilities of a word belonging to
a given word class, with the probabilies calculated using
noun-verb co-occurrence pairs (japanese+BGH thesaurus)
– algo. originally developed for document categorization –
each noun is represented by a set of co-occuring verbs
Lin & Pantel (2002): each word is represented by a feature
vector, each feature correspond to a context in which the
word occurs (threaten with _ is a context and if handgun
occurred in that context the context is a feature of
handgun) the value of a feature is the MI between feature
and the word; similarity between 2 words is calculated
using cosine coef. of their MI vectors – clustering is then
based on these results
10
Oslo, 14-16 Sep 2003
So… enhancing
SIMPLE by…

…Analyzing Compounds
a large number of compounds can inherit relevant parts
of semantic info provided that the heads of lexemes occur
in SIMPLE; testing for lexicalisation in GLDB in order to
avoid incorporation of idiomatic or metonymic meanings;
applying compound segmentation

…Semantic similarity in NPs of enumerative type
use of partial parsing on large corpora;
words entering into the same syntagmatic relation with
other words are perceived semantically similar; however,
certain conditions must be satisfied in order to avoid
incorporation of erroneous entries

…Lexico-syntactic patterns
for acquiring higher in the hierarchy concepts
11
 see examples
Oslo, 14-16 Sep 2003
Extending SIMPLE
… illustration

Compounding example:
färja?, kryssningsfartyg?, tankers? och ro-ro-fartyg?
>> No matches ferries, cruise-ships, tankers and ro-ro-vessels
färja? kryssnings#fartygVEH tankers? ro-ro-#fartygVEH
>> färjaVEH kryssningsfartygVEH tankersVEH ro-ro-fartygVEH

Enumerative NP example:
juristerOCC-AG, läkareOCC-AG, optikerOCC-AG, psykologer? och
sjukgymnaster? >> 3 Matches
lawyers, doctors, opticians, psycologists and physiotherapists
>> condition: if >2 have same tag & rest no ==> add in
lexicon! >>psykologOCC-AG sjukgymnastOCC-AG
• Lexico-syntactic pattern example:
älgar, sorkar, fåglar, kor, hästar och andra djur
12
Oslo, 14-16 Sep 2003
Compounding

take advantage that Swedish is a compounding language
(e.g. >70% of SAOL are compounds)
–
–
–
–
–
–
–
single orthographic units
many compound words are lexically not represented
generally having predictable meanings - relatively transparent
most compounds are essentially binary & in most cases both
elements are represented in GLDB
given a sizeable number of analysed compounds its possible to
automatically establish a ”semantic compounding profile” for all
lexemes in predictable compounds
meaning as a function of the meaning of the components related
to each other by an implied predicative functor
e.g. brödkniv brödXknivY ‘bread knife’ implies ‘Y for (cutting) X ’
see Järborg, Kokkinakis & Toporowska-Gronostaj, ’02

used compounds from the GLDBs synonym-slot

… and corpora … but the have to be segmented & anaysed
13
Oslo, 14-16 Sep 2003
Semantic
Compound Definitions
Semantic Definition
Example
Y that is located in/at…
klassrumsdörr classroom+door
Y that is made up of X
kanalsystem canal+system
smutsfläck dirt+stain
kaninjakt rabbit+hunt
partikelfysik particle+physics
Y that originates from X
Y that is aimed at X
Y that is about X
Y that produces X
Y that prevails in X
Y that contains X
Y that consists of X
Y that has to do with X
14
batterifabrik battery+factory
partiideologi party+ideology
kaffetermos coffee+thermos
kaffepulver coffee+powder
....... klädbesvär clothes+trouble
.......
Oslo, 14-16 Sep 2003
An Example Profile
for ´område´
marknad.1.2.0
avrinning.1.1.0
bangård.1.1.0
mark.1.2.0
barrskog.1.1.0
katastrof.1.1.0
kust.1.1.0
område.1.1.0 <geogr.>
land.1.1.b
Luleå.PM
Medelhavs.PM
marknadsföra.1.1.0
myr.1.1.0
affär.1.2.b
kommunikation.1.2.0
avtal.1.1.0
kompetens.1.1.0
område.1.1.b <abstr.>
kunskap.1.1.a
kultur.1.2.0
kärna.1.1.c
kostnad.1.1.0
läkemedel.1.1.0
motiv.1.2.0
15
Oslo, 14-16 Sep 2003
Compounds fr.
GLDB

already disambiguated...

GLDB & S-SIMPLE entries linked to
the sub-senses in GLDB


e.g. S-SIMPLE encodes the noncompound lemma ämne (as having 4
senses, marked 1/1-1/4), which are
disambiguated here by means of their
assignment to the following semantic
types and semantic classes:
– Material: Matter ‘material’
– Substance: Substance ‘stoff’
– Part: Abstract ‘topic’
– Domain: Notion ‘subject,
discipline’
Each of the senses is exemplified in
GLDB with a number of compounds,
comprising 26 in total with ämne as
the head
SIMPLE (5)
GLDB (26)
ämne:1/1:Matter
färgämne:1/1
grundämne:1/1:Matter
hornämne:1/1
ämne:1/2:Substance
…
ämne:1/3:Abstract
yxämne:1/2
ämne:1/4:Notion
fruktämne:1/2
…
predikoämne:1/3
uppsatsämne:1/3
…
läroämne:1/4
skolämne:1/4
16
Oslo, 14-16 Sep 2003
Compounds fr.
Corpora
Heuristic compound decomposition/segmentation and matching of the
SIMPLE content with the heads of the segmented compounds
•
Try to distinguish the modifier’s characteristics
(pos & semantic category - if any)
• is modifier=adjective or proper-noun? OK
•
e.g. klocka digital||klocka; stor||klocka anhängare
•
anhängare Hitler||anhängare; Likud||anhängare
• S-SIMPLE as a means of bootstrapping the process
• e.g. glas ‘glass’, extended with compounds having SUBSTANCE as a
modifier:[vatten,vin,öl,likör]glas: ‘water, wine, beer’ and ‘liqueur’
•
Check against lists of lexicalized ones to eliminate incorrect data =>
GLDB allow the exclusion of such compounds from the derived sets
•
e.g. feber - 40 compounds from corpora, e.g. scharlakansfeber but not all are ILLNESS ‘resfeber’ ‘diamantfeber’
17
Oslo, 14-16 Sep 2003
Heuristic Compound
Segmentation



previous attempts to
segment Swedish
compounds without the
help of a “real” lexicon
are described in Brodda
(1979)
based on the
distributional properties
of graphemes, trying to
identify grapheme
combinations indicating
possible boundaries
(promising for Germanic
languages)
mostly automatic with
some manual work
18
sd
sg
tk
tp
is||dans (ice-dance)
bidrags||givare (contributor)
bröst||kirurgi (breast surgery)
vit||peppar (white pepper)
dsb
psr
psd
ftv
rnk
lands||bygd (countryside)
bröllops||resa (honeymoon trip)
kropps||delen (body part)
luft||värme (air warmth)
kärn||kraft (nuclear power)
ngss
tsfa
gssp
spla
spap
honungs||sött (honey sweet)
besluts||fattare (decision-maker)
vardags||språket (colloquial language)
femårs||plan (five year plan)
bakplåts||papper (baking-plate paper)
Oslo, 14-16 Sep 2003
Compound
Processing cont´d
• Estimation >20-25 compounds per S-SIMPLE entry (for
NOUNS)
• Based on: 1,000 nouns in SIMPLE; increased the
vocabulary to >22,000
• The top-5 non-compound entries from corpora, most rich
in compound variants (some very ambiguous!)
•
program ‘programme, program’ (469 diff. comp.)
arbete ‘work, employment’ (402 diff. comp.)
chef ‘chief’ (390 diff. comp.)
bok ‘book’ (357 diff. comp.)
verksamhet ‘activity, operation’ (299 diff. comp.)
19
Oslo, 14-16 Sep 2003
Modifier’s
Characteristics
bad||toffla#garment
dt
barn||vårds||lärare#occupation_agent
rnv, dsl
bas||bolag#agency
sb
bläck||fisk#fish
kf
bolags||plundrare#occupation_agent
gspl
brud||bergs||skola#abstract#agency#functional_space
gss, db
bygg||bolag#agency
gb
bygg||företag#agency
gf
centralbanks||chef#occupation_agent
ksch
doping||brott#change
20
SIMPLE
ngb
Oslo, 14-16 Sep 2003
Syntactic Parsing
(1)
Compounds are a valuable resource; but how can we cope
with the rest of the vocabulary?
Corpus-driven approach to acquire semantic lexicons
cf. Kokkinakis, 2001
Investigate how, and to what extent the flexibility and
robustness of a partial parser can be utilized to fully
automatic extend existing semantic lexicons - cascaded
finite-state syntactic parser;
– Observation: members of a semantic group are often
surrounded by other members of the same group in text; in
other words: words entering into the same syntagmatic relation
with other words are perceived as semantically similar
21
Oslo, 14-16 Sep 2003
Syntactic Parsing
(2)
Corpus: 40 mil. tokens (Swedish Language Bank) tagged with Brill's
tagger
Parsing using CASS-SWE in which levels or bundles of rules of very
special characteristics & content can be rapidly created & tested
e.g. specific types of NPs (takes pos-tagged texts as input)

Example - simplified:
– Rule => ‘DETERMINER? COM-NOUN (COM-NOUN F)* COMNOUN CONJ COM-NOUN’ (färger, penslar, papper och matsäckar)
– Rule => ‘APPOSITION-NOUN? PROP-NOUN+ (F PROP-NOUN)+
CONJ PROP-NOUN+’ (Venezuela, Trinidad och Island)
Amount of unique retrieved phrases were ca 36,000 (phrases without
proper names) and ca 72,000 (phrases with proper names)
22
Oslo, 14-16 Sep 2003
Syntactic Parsing
(3)
1. Gather, pos-annotate & parse large corpora
2. Filter out long NPs; & Filter out knowledge-poor elements
3. 1st Pass: Measure the overlap between the members of the
phrases extracted and the entries in the semantic lexicon;
3a. If conditions apply, add new categorised entries in the
database;
3b. Repeat the previous 2 steps, until very few or nothing is
matched;
4. 2nd Pass: Compound segment members of the phrases left;
4a. Check whether they are lexicalised, do not use them if
they are;
4b. Repeat the process from step (3) by matching this time
the
heads with the content of the database
23
Oslo, 14-16 Sep 2003
Syntactic Parsing
(4)
Large quantities of partially parsed corpora is an important ingredient for
the enrichment and further development of the semantic resources –
cf. all previous attempts: use syntax for generating semantic
knowledge
From the forest of chunks produced, filter out long NPs (=>3 Com.
Nouns), lemmatise, normalise, filter out knowledge-poor elements
(determiners, punctuation) & measure the overlap between the nouns
in the NPs and the entries in S-SIMPLE
If at least 2 of the nouns in the NPs are entries in SIMPLE, with the
same semantic class, then there is a strong indication that the rest
of the nouns are co-hyponyms, thus semantically similar with the
two already encoded in S-SIMPLE – iterate
Apply compounds segmentation on the members of the phrases left –
check for lexicalization in a def. dictionary (GLDB) don’t use them are
lexicalized – repeat previous step & iterate BUT match the heads!
24
Oslo, 14-16 Sep 2003
First Pass
Overlap
Matching a db with the content of the resources against the content of
the phrases
Assume: if at least 2 of the members of a phrase are also entries in the
lexicon, with the same semantic class, and the rest of the phrase
members have not received a semantic annotation, then there is a
strong indication that the rest of the members are co-hyponyms, and
thus semantically similar with the two already encoded in the lexicon.
Accordingly, we annotate them with the same semantic class
e.g. lawyers, doctors, opticians, psycologists and physiotherapists
juristerOCC-AG, läkareOCC-AG, optikerOCC-AG, psykologer? och
sjukgymnaster? ===> 3 Matches
==> condition:
if >2 have same tag & rest no ==> add in lexicon!
psykologOCC-AG sjukgymnastOCC-AG
25
Oslo, 14-16 Sep 2003
Second Pass
Overlap
A large number of phrases not used; none or only one of the
members of the phrases was covered by SIMPLE, either
the original or the enriched version
Take account the compounding characteristic of Swedish
(> 70% or 80,000 in SAOL are compounds); Heuristic
decomposition of compounds & matching the SIMPLE
content with the heads of the segmented compounds
Assume: a considerable number of casual or on the fly
created compounds can inherit relevant parts of semantic
info. provided on their heads by SIMPLE
e.g.: färjor?, kryssningsfartyg?, tankers? och ro-ro-fartyg?
===> No matches (ferries, cruise-ships, tankers and ro-ro-vessels)
färja? kryssnings||fartygVEH tankers? ro-ro-||fartygVEH
===> färjaVEH kryssningsfartygVEH tankersVEH ro-ro-fartygVEH
26
Oslo, 14-16 Sep 2003
Syntactic Parsing
(5)
•
Errors/noise can be eliminated, if the semantic tags
of all the words in a phrase are compared
kvinnor:BIO, barn:BIO, husdjur:??? och möbler:FURNITURE
•
Ambiguities are propagated
flaskor:CONTAINER-AMOUNT, tallrikar:CONTAINER-AMOUNT, vinglas:???
Result:
Approx. 3,300 new noun entries to the Swe-S could be
identified without any further processing (i.e.
bootstrapping the compound analysis) – and only during
the ‘first pass’
27
Oslo, 14-16 Sep 2003
Loooong NPs (1)





har jag ätit ko, gris, lamm, häst, hare, kanin, ren, älg, känguru, orre,
tjäder, duva, kyckling, anka, gås, struts, krokodil, haj, lax, torsk,
abborre, gädda, bläckfisk och en massa firrar till …
ekonom sociolog litteraturvetare stadsplanerare mediaexpert filosof
reklamfolk företrädare formgivare ingenjör författare diktare filmare
popmusiker leksaksfabrikant klädskapare arkitekt journalist
vetenskapsman... (press98)
inflationsutveckling framtidstro orderingång arbetsmarknadspolitik
företagsbeskattning ränteläge handelshinder investeringstakt
råvarupris produktionsutveckling…
slangnipplar slangpumpar flödesmätare gummihandskar
röntgenapparater proteser testcyklar diskmaskiner journalsystem
bensågar kuvöser blodmixrar urintestremsor centrifuger... (press95)
bokstav måttband klocka miniräknare plastbestick barnbild nyckel
batterier filmrulle (SUC)
28
Oslo, 14-16 Sep 2003
Loooong NPs (2)




Belgien Danmark Frankrike Grekland Island Italien Kanada Luxemburg
Nederländerna Norge Portugal Spanien Storbritannien Turkiet
Tyskland USA… (p97)
all världens ortnamn : Lahti , Kalundborg , Oslo , Motala , Luleå ,
Moskva , Tromsö , Vasa , Åbo , Rom , Hilversum , Vigra , Bryssel ,
London , Prag , Athlone , Köpenhamn , Stuttgart , München , Riga ,
Stavanger , Paris , Warszawa , Bodö och Wien… (romii)
Birte Heribertson Bodil Mårtensson Anette Norberg Bror Tommy
Borgström Karin Bergqvist Mats Ågren Mattias Renehed Tobias
Ekstrand… (p96)
Robert Hedman , Kjell Jönsson , Ingemar Eriksson , Jonas Runesson ,
Miguel Exposito , Micke Berg , Lars Oscarsson , Fredrik Aliris , Jimmy
Anjevall , Putte Johansson , Petter Jokobsson , Daniel Edfalk , Mattias
Larsson , Daniel , Westerlund , Daniel Johansson , Peter ...
29
Oslo, 14-16 Sep 2003
Evaluation (1)
Quantity Evaluation of the Syntactic Parsing
approach (see Kokkinakis, 01)
Results after six iterations:
SIMPLE
NAMES
30
Original Pass-1
Pass-2
Total
2,921
5,110
1,100
9,131
10,550
25,700
--36,250
Oslo, 14-16 Sep 2003
Evaluation (2)
Quality Evaluation: Manually, for a number of groups based
on common sense and judgement
Class
OrganisationNE
Original
New
Wrong/Spurious
Precision
1300
395
22
94,4%
Phenomenon
36
29
9
69%
Bio
46
107
12
88,8%
Ideo
17
74
9
97,8%
Vehicle
33
118
17
85,6%
Apparatus
22
27
2
92,6%
Garment
25
184
19
89,7%
Illness
38
66
8
87,9%
Flower
19
26
3
31
88,5%
Oslo, 14-16 Sep 2003
Examples of
Acquired Entries (1)
BIO: any classification of human beings (groups or
individuals) according to a biological chracteristic like age,
sex, etc; i.e. adult, twin, brother, bastard, husband,
miss…
ORIGINAL (46): bror, fru, hustru, son, tjej, gudbarn, ...
NEW (107): barn, barnbarnsbarnbarn, children!!, dotter,
dotterdotter, fader, far, farbror, farfader, farfarsfar,
farförälder, farmoder, faster, flickvän, fosterförälder,
fästmö, huskarl, hustru, jungfru, kusin, …
SPURIOUS/WRONG (12): orientarmé, regnskog,
sjukhuspersonal, skilsmässa, sopa, studieförbund, svågra,
totalisatorspel, trapetsartist, tutsier, älder, äppelträd
32
PRECISION: 88,8%
Oslo, 14-16 Sep 2003
Examples of
Acquired Entries (2)
APPARATUS: tools or devices used together to provide a
particular functionality for a particular task; i.e.
dishwasher, camera, computer, recorder…
ORIGINAL (22): video, kamera, frys, kopiator, mixer, ...
NEW (27): bandspelare, cd-rom-läsare, cd-spelare, dator,
dvd-spelare, faxapparat, filmkamera, frysbox, handdator,
nätverksdator, radio, skrivare, symaskin,
televisionsapparat, teve-apparat, tv-apparat,
videoapparat, ...
SPURIOUS/WRONG (2): fonduegryta??, skafferi
33
PRECISION: 92,6%
Oslo, 14-16 Sep 2003
Examples of
Acquired Entries (3)
VEHICLE: artifacts (or their parts) made for the transport of
goods, livestock or people; i.e. truck, sedan, bicycle, license
plate!!!,submarine…
ORIGINAL (33): kajak, bil, jeep, båt, flotte,…
NEW (118): ambulans, brandbil, buss, charter, direktbuss,
distributionsbil, elbil, flakmoped, flakmoppa, flodbåt, flyg,
flygplan, fordon, fregatt, färja, helikopter, husvagn,
hästfordon, hästkärra, korvett, krigsfartyg, lastvagn, …
SPURIOUS/WRONG (17): anläggningsmaskin,
arbetsmaskin, artilleri, artilleripjäs, entreprenadmaskin,
förband, förvaltningsmyndighet, gräsklippare, skida
PRECISION: 85,6% Oslo, 14-16 Sep 2003
34
Evaluation (3)
Quality Evaluation nr2
Comparison with 2 Synonym Dictionaries
(Missing in STR+BON:
ösregn, spöregn, hällregn!
STRÖMBERGS & BONNIERS
SIMPLE
Label
bil - car
VEHICLE
regn rain
PHENOM.
rederi –
AGENCY
shipping
company
35
STR+BON Missing in SIMPLE
(x+x=unique)
7+8=11 3 – vagn, kärra, åk
17+14=21 15 – väta, ström, flod, dusch,
kaskad, våtväder etc.
3+4=6 5 – skeppsägare, linje, båtbolag,
fartygsbolag, sjöfartsbolag
Oslo, 14-16 Sep 2003
Error Analysis
Source of Errors:
• Part-of-speech and lemmatisation errors
• A number of long, enumerative NPs with many
unknown to the lexicon entries, where 2 or 3
(happened) to correctly get the same semantic label
but some the wrong one
tröjaGARMENT halsduk strumpaGARMENT underkläder skiva album
=>
GARMENT ...
assigned to the rest...
•… and of course polysemy
depressionEMOTION ångestEMOTION spänning? => EMOTION ...but
tryckATTRIBUTE spänningEMOTION? vibration tyngdkraftATTRIBUTE
36
Oslo, 14-16 Sep 2003
Lexico-syntactic
Patterns


Compounding and enumerative NPs are a good starting
point for acquiring synonyms & co-hyponyms
Pattern based lexico-syntactic recognition is suitable for
acquiring hyperonyms-hyponyms (and partly meronyms)

Language specific patterns

Discovery by observation

A good parser is necessary – good coverage of NPs

Requires more research on the effects of the various
modifiers that can alter the semantic relation
37
Oslo, 14-16 Sep 2003
Lexico-syntactic
Patterns (1)
hyperonym-hyponym

NP av (typ/en|märke/t|model/len|…) ("|'|:)? (NP|(NP,)+)
(och NP|eller NP)?
… en bil av märket Ford Granada …
… okänd soldat som bar gymnastikskor av märket Nike …
… sys bland annat kalsonger och undertröjor av märket
Börje Salming …
… tusen personbilar av modellen S70/V70 i Masas fabrik .
… planen är av typen F117A ( stealth ) …
… fartygen har jaktplan av typen F14 som anpassats att
bära laserstyrda …
38
Oslo, 14-16 Sep 2003
Lexico-syntactic
Patterns (2)
Hyperonym?-hyponym?

NP ,? (såsom|liksom|som)(NP|(NP,)+|:NP|:(NP,)+) (eller|och) (andra|annat|annan) NP

NP ,?

NP ,? (såsom|liksom|som)
(eller|och) (andra|annat|annan) NP
(andra|annat|annan) NP
… explorer plockar poäng på automatlåda , farthållare , luftkonditionering , radio och
annan utrustning
… fastighetsägaren ville ha en total renovering med ny spis , kyl , frys , spiskåpa och
annan köksinredning
hyperonym-hyponym

NP : NP (NP ,)+ (m fl|med flera|mm|osv)?
… årets dansband : Arvingarna , Barbados , Joyride , Sound Express .
… riksdagsmännens alla bidrag : barnbidrag , bostadsbidrag , socialbidrag , studiebidrag
osv .
… kroniskt sjuka : epileptiker , hjärtsjuka , njursjuka m fl
… bästa webbplatserna : Spray , Gula Sidorna , Dagens_Nyheter , Passagen ,
Arbetsförmedlingen , Resfeber , Pricerunner , Bidlet , SEB och Bluemarx .
39
Oslo, 14-16 Sep 2003
Lexico-syntactic
Patterns (3)
hyperonym-hyponym

NP ,?|(? inklusive
(NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?

NP ,? (? särskilt
(NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?

NP ,? (? speciellt
(NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?

NP ,? (? mestadels (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?

NP ,? (? däribland (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)? )?
… en rad företag , däribland Ica , Dagab och Ikea
… Natoländer , inklusive Frankrike , Tyskland , Spanien och Grekland
hyperonym-hyponym

NP som (till exempel|t ex|t.ex.) NP (, NP)*
… stora båtar som till exempel segelfartyg
… storhelger som t ex nyårsdagen , juldagen har vi …
… finns det specialavdelningar att se på mässan? som t ex Classic boat show ,
surfexpo , sjösäkerhet och dykexpo .
40
Oslo, 14-16 Sep 2003
Lexico-syntactic
Patterns (4)
hyperonym-hyponym

(sån/a/t|sådan/a/t)? NP ,? (som|såsom) (NP|(NP,)+|:NP|:(NP,)+) (och NP|eller NP)?
… välkända biorullar såsom Carrie , Eldfödd , Stalker , Den onda cirkeln ,
Shining och Matilda
… flera färger såsom lichtgult , svart , vitt , rött , blått , grönt ,
… en rad underspecialiteter , såsom kardiologi , gastro-enterologi , endokrinologi
, hematologi , njurmedicin och reumatologi .
hyperonym-hyponym

NP : NP (, NP)+ (och NP|eller NP)?
… leverantörerna av affärssystem : SAP , Intentia , IFS och IBS
… folksjukdomarna : alkoholism , ätstörningar , medicinmissbruk och
panikångest
… krafter av olika slag : tyngdkraft , muskelkraft , friktionskraft , magnetisk kraft
41
Oslo, 14-16 Sep 2003
Lexico-syntactic
Patterns (5)
hyperonym-hyponym

NP (, NP)+ är några av NP
…" Nilens dotter " , " Sorgens stad " och " Marionettmästaren " är några av de filmer …
… La-Seyne-sur-Mer , Orléans , Brest och Dijon är några av de städer…
… språk , internationell rätt , utrikes- och säkerhetspolitik , press- och
informationsfrågor , administration samt muntlig och skriftlig framställning
några av de ämnen som studeras …
… El Salvador , Kazakstan och Jamaica är några av de länder som nu …
holonym-meronym
•
NP? som? består?SENSE? av NP (, NP)+ (och NP)?
… instrumentalensemblen? som består av flöjt , klarinett , trombon, gitarr , violin ,…
…” De ensamma öarna?” som består av Koufonissi , Iraklia , Donousa och Schinousa
… av företagsamhet som består av produktutveckling , produktion , distribution och
försäljning
42
Oslo, 14-16 Sep 2003
är
Conclusion &
Outlook
simple, surprisingly efficient methods to acquire/enhance
general purpose semantic knowledge from large corpora
profiting from the productive compounding characteristic of S.
use of partially parsed corpora for extending semantic
lexicons, a unified way to process compounds
both parsing & compounding are of equal importance, through
parsing we allow the incorporation of new, mainly noncompound words, through compounding we allow new
compounds of existing entries; Kokkinakis et al. ’00
better means of evaluation and decrease the amount of
spurious generated entries (many due to pos)
43
Oslo, 14-16 Sep 2003
Conclusion &
Outlook cont´d
We believe that S-SIMPLE can be extended to a large
semantic resource appropriate for a large number of
(intermediate) NLP tasks;
Its compatibility with the manually developed S-SIMPLE
lexicon, can be guaranteed and its high quality maintained

near future - NOV ‘03: expect evaluation from VR –
whether our application will get funded or not – passed
through 1st step but that doesnt guarantee success
==> goal: larger corpus; more comprehensive study; combine
compounding, parsing, patterns and statistics
44
Oslo, 14-16 Sep 2003
References











Brodda B. (1979). Något om de svenska ordens fonotax och morfotax: Iakttagelse med utgångspunkt från experiment
med automatisk morfologisk analys. In: ”I huvet på Benny Brodda”. Festskrift till densammes 65-årsdag.
Grefenstette G. (1994). Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers.
Hearst M. (1992) Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of the 14th International
Conference on Computational Linguistics. Nantes, France
Järborg J., Kokkinakis D. & Toporowska-Gronostaj M. (2002). Lexical and Textual Resources for Sense Recognition and
Description. Proceedings of the 3rd LREC, Las Palmas.
Kokkinakis D., Toporowska Gronostaj M. and Warmenius K. (2000) Annotating, Disambiguating & Automatically Extending
the Coverage of the Swedish SIMPLE Lexicon. Proceedings of the 2nd Languages Resources and Evaluation Conference
(LREC), vol. III:1397-1404. Athens, Hellas.
Kokkinakis D. (2001). Syntactic Parsing as a Step for Automatically Augmenting Semantic Lexicons. Proceedings of the
39th Association of Computational Linguistics (ACL) and 10th European Chapter of the Association of Computational
Linguistics (EACL), 13-18. Miltsakaki E., Monz C. and Ribeiro A. (eds). (Companion Volume). CNRS, Toulouse, France.
Lin D. (1998). Automatic Retrieval and Clustering of Similar Words. COLING-ACL98, Montreal, Canada.
Lin D. & Pantel P. (2002). Concept Discovery from Text. Proceedings of the International Conference on Computational
Linguistics. pp. 577-583. Taipei, Taiwan.
Riloff, E., and Shepherd, J. 1997. A Corpus-Based Approach for Building Semantic Lexicons. Proceedings of the Second
Conference on Empirical Methods in Natural Language Processing, 117--124.
Roark B. & Charniak E. (1998). Noun-phrase co-occurrence statistics for semi-automatic semantic lexicon construction.
Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International
Conference on Computational Linguistics, pages 1110-1116.
Takunaga et al. (1997) Extending a thesaurus by classifying words. Automatic Information Extraction and Building of
Oslo, 14-16 Sep 2003
NLP Applications.
45Lexical Semantic Resources for
Descargar

SCHEMAS Workshop 3 Introduction