6th Intex Workshop &
10 years of (Silberztein, 1993)
Sofia, 28-30 May 2003
6th Intex Workshop, Sofia 28-30 May 2003
1
Conversion between Intex and
MULTEXT-East Morphosyntactic
Descriptions
Cvetana Krstev, Duško Vitas
University of Belgrade
Tomaž Erjavec
Jožef Stefan Institute, Ljubljana
6th Intex Workshop, Sofia 28-30 May 2003
2
Motivation

general
•
•
•

use of different tools
use of multilingual resources
comparison of results in NLP
specific
•
•
inclusion of Serbian language in MULTEXT-East
specification and production of Slovenian Intex
resources
production of tagged Serbian translation of
Orwell's 1984
6th Intex Workshop, Sofia 28-30 May 2003
3
MULTEXT-East morphosyntactic
specification


aim
exhaustive description of morphological and
morphosyntactic features of different
languages and establishment of unique
codes for common features
scope:
English, Romanian, Slovene, Czeck,
Bulgarian, Estonian, Hungarian, Croatian
(Concede), and Serbian
6th Intex Workshop, Sofia 28-30 May 2003
4
14 MULTEXT-East types or PoS
- new types cannot be introduced







Nouns (N)
Verbs (V)
Adjectives (A)
Pronouns (P)
Determiners (D)
Adpositions (S)
Conjuctions (C)







Numerals (M)
Interjections (I)
Abbreviations (Y)
Particles (Q)
Adverbs (R)
Articles (T)
Residuals (X)
6th Intex Workshop, Sofia 28-30 May 2003
5
Type attributes



Each type has a set of attributes that are
appropriate to it
Each type attribute has its position in MSD
description
It is not recommended to add new attributes
to a type
6th Intex Workshop, Sofia 28-30 May 2003
6
Attribute values



a set of values is added to each attribute
each value is coded by one alphanumeric
character
the new values can be added to the
attributes, if necessary
Types
Verb attributes
Adjective attributes
6th Intex Workshop, Sofia 28-30 May 2003
7
Adjective attribute values/1
=
P
=
1
Adjective (A)
13 positions
==============
ATT
==============
Type
==============
VAL
==============
qualificative
indefinite
possessive
ordinal
- -------------- -------------2 Degree
positive
comparative
superlative
elative
- -------------- --------------
=
C
=
f
i
s
o
p
c
s
e
-
EN RO SL CS BG ET HU HR SR
x x x x x x x x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
6th Intex Workshop, Sofia 28-30 May 2003
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
8
Adjective attribute values/2
=
P
=
3
==============
ATT
==============
Gender
==============
VAL
==============
masculine
feminine
neuter
- -------------- -------------4 Number
singular
plural
dual
paucal
- -------------- -------------5 Case
nominative
genitive
dative
accusative
...(various
*
= EN RO SL CS
C x x x x
=
m
x x x
f
x x x
n
x x x
s
x x x
p
x x x
d
x x
c
n
x x
g
x x
d
x x
a
x x
more values)..
6th Intex Workshop, Sofia 28-30 May 2003
BG ET HU HR SR
x x x x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
9
Adjective attribute values/3
6 Definiteness
no
n
x x
yes
y
x x
short_art
s
full_art
f
- -------------- -------------- 7 Clitic
no
n
x
yes
y
x
- -------------- -------------- 8 Animate
no
n
x x
yes
y
x x
- -------------- -------------- 9 Formation
nominal
n
x
compound
c
x
- -------------- -------------- ... various Hungarian specific attributes...
================================= EN RO SL CS
6th Intex Workshop, Sofia 28-30 May 2003
x
x
x
x
x
x
x
x
x
x
x
x
x
x
BG ET HU HR SR
10
An example from the Slovenian
MULTEXT-East dictionary
čistejši
čist
Afcfda
lemma čist (Engl. clean) corresponds to the simple word
form čistejši; it is qualified as qualificative (f) adjective
(A) in comparative form (c), feminine gender (f), dual
number (d), and accusative case (a).
čistejši
čist
Afcmsa--n
lemma čist (Engl. clean) corresponds to the simple word
form čistejši; it is qualified as qualificative (f) adjective
(A) in comparative form (c), masculine gender (m),
singular (s), accusative case (a), and not animate (n).
6th Intex Workshop, Sofia 28-30 May 2003
11
The first sentence of the Slovene
translation of Orwell's 1984 tagged
<w lemma="biti" ana="Vcps-sma">Bil</w>
<w lemma="biti" ana="Vcip3s--n">je</w>
<w lemma="jasen" ana="Afpmsnn">jasen</w>
<c>,</c>
<w lemma="mrzel" ana="Afpmsnn">mrzel</w>
<w lemma="aprilski" ana="Aopmsn">aprilski</w>
<w lemma="dan" ana="Ncmsn">dan</w>
<w lemma="in" ana="Ccs">in</w>
<w lemma="ura" ana="Ncfpn">ure</w>
<w lemma="biti" ana="Vcip3p--n">so</w>
<w lemma="biti" ana="Vmps-pfa">bile</w>
<w lemma="trinajst" ana="Mcnpnl">trinajst</w>
6th Intex Workshop, Sofia 28-30 May 2003
12
Intex MSD for Serbian


one DELAS entry cyist,A17
one of its corresponding DELAF entries
cyistiji,cyist.A17:bems1g:bems4q:bems5g:bemp1g
:bemp5g

produced by the regular expression A17.exp
..............
ijemu/:bems3g:bems7g:bens3g:bens7g +
iji/:bems1g:bems4q:bems5g:bemp1g:bemp5g +
o/:aens1g:aens4g:aens5g +
..............
6th Intex Workshop, Sofia 28-30 May 2003
13
Attributes and their values for Serbian
adjectives in DELAS/DELAF
Attribute
Value
Code
Attribute
Value
Code
degree
positive
a
case
nominative
1
comparative
b
genitive
2
superlative
c
dative
3
no
k
accusative
4
yes
d
vocative
5
not applicable
e
instrumental
6
masculine
m
locative
7
feminine
f
yes
v
neuter
n
no
q
singular
s
not-applicable
g
plural
p
(not important)
definiteness
gender
number
animate
6th Intex Workshop, Sofia 28-30 May 2003
14
Syntactic and semantic marks in
Serbian DELAS
category
tag
applied to
explanation
example
syntactic
+p2
prepositions
noun is in genitive
bez,PREP+p2
+Ref
verbs
reflexive
dicyiti,V551+Imper
f+It+Ref
+MG
nouns
masculine natural
gender
budala,N601+Hum+MG
+FG
+VN
nouns
verbal noun
kiselxenxe,N300+VN
+Adj
adverbs
derived from
adjectives
fanaticyno,ADV+Adj
+DerOvaIra
verbs, nouns,
adjectives
derivational variaty
dezinfikovati,V18+
Imperf+...+DerOvaI
ra
+Col
adjectives
colors
zelenkastosiv,A6+C
ol
+Hum
nouns
human
lxubavnica,N601+Hu
m
+Mat
adjectives
material
kozxnat,A6+Mat
+Ek
all
ekavien
nedelxa,N600+Ek
+Cr
all
croatism
izopcxen,A1+PP+Cr
derivational
semantic
dialectic
6th Intex Workshop, Sofia 28-30 May 2003
15
Problems of correspondence between
MULTEXT-East MSD and Intex/1

The necessity to enforce the existing coding schema to
a particular language
Example: How to encode present and past gerund
active?
In Serbian, for the verb ići (Engl.
gerunds are idući and išavši
to go) those
There are attributes in verb tables of MULTEXT-east
specification that describe them. However, no Slavic
language, except Bulgarian, uses it.
6th Intex Workshop, Sofia 28-30 May 2003
16
Problems/2

the common encoding schema does not guarantee that
true standardization would be achieved
Example:
only in Bulgarian do we find the attribute value
'adjectival' for adverbs (with the examples 'umno, veselo,
studeno') – other Slavic languages, at least, could make
use of that value of the attribute type.
6th Intex Workshop, Sofia 28-30 May 2003
17
Problems/3

=
P
=
2
Encoding of verb tenses
==============
ATT
==============
VForm
==============
VAL
==============
indicative
subjunctive
imperative
conditional
infinitive
participle
gerund
supine
transgressive
quotative
- -------------- -------------3 Tense
present
imperfect
future
past
pluperfect
aorist
=
C
=
i
s
m
c
n
p
g
u
t
q
p
i
f
s
l
a
EN RO SL CS BG ET HU HR SR
x x x x x x x x x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
6th Intex Workshop, Sofia 28-30 May 2003
x
x
x
x
x
x
x
x
x
18
Problems/3

The second attribute specifies verb form, and the third
the tense. However, due to the composite tenses, some
verb forms are used for the construction of different
tenses. In Slovenian, verb form imel is past participle of
the verb imeti (Engl. to have), and it is used to
produce perfect tense if used with the indicative form of
the present tense of the copula verb biti (Engl. to be)
and conditional if used with the conditional form of the
same copula verb.
6th Intex Workshop, Sofia 28-30 May 2003
19
Problems/3
<w lemma="Winston" ana="Npmsn">Winston</w>
<w lemma="Smith" ana="Npmsn">Smith</w>
<w lemma="biti" ana="Vcip3s--n">je</w>
<w lemma="imeti" ana="Vmps-sma">imel</w>
..........................................
<w lemma="da" ana="Css">da</w>
<w lemma="biti" ana="Vcc">bi</w>
<w lemma="on" ana="Pp3msa--y-n">ga</w>
<w lemma="imeti" ana="Vmps-sma">imel</w>
6th Intex Workshop, Sofia 28-30 May 2003
20
Problems/4

different interpretation of various grammatical categories
across languages and lack of a clear cross-linguistic
correspondance are discussed in Przepiórkowski (EACL
2003), for example dual number in Slovene and paucal
in Serbian.

certain morphosyntactic phenomena have not been
taken into consideration, as various problems of
agreement (Vitas, Krstev, to appear).
6th Intex Workshop, Sofia 28-30 May 2003
21
Application of MSDIntex mapping
to Serbian 1984
{S}{Bio,biti.V77:Gsm}
({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} +
{je,on.PRO+Prs:sz2fi:sz4fi})
{vedar,.A18:akms1g:akms4q}
({i,.CONJ} + {i,.PAR})
{hladan,.A18:akms1g:akms4q}
{aprilski,.A2+PosQ:adms1g:aems4q:aems5g:aemp1g:aemp5g}
({dan,.A1+PP:akms1g:aems4q} +
{dan,dati.V103+Perf+Tr+Iref+Ref:Tms})
;
{S}
({na,.PREP+p4} + {na,.PREP+p7})
{cyasovnicima,.?}
({je,jesam.V575+Imperf+It+Iref+Aux:Pzsi} +
{je,on.PRO+Prs:sz2fi:sz4fi})
{izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn}
{trinaest,.?}
.
6th Intex Workshop, Sofia 28-30 May 2003
22
Tool that facilitates the lemmatization
and disambiguation
6th Intex Workshop, Sofia 28-30 May 2003
23
Tagged Serbian translation of 1984 after hand
disambiguation and resolving of unknown words
{S}{Bio,biti.V77:Gsm}
{je,jesam.V575+Imperf+It+Iref+Aux:Pzsi}
{vedar,.A18:akms1g}
(i,.CONJ)
{hladan,.A18:akms1g}
{aprilski,.A2+PosQ:adms1g}
{dan,.N1:ms1q}
;
{S}
{na,.PREP+p7}
{cyasovnicima,cyasovnik.N5:mp7q}
{je,jesam.V575+Imperf+It+Iref+Aux:Pzsi}
{izbijalo,izbijati.V101+Perf+Tr+It+Iref:Gsn}
{trinaest,.Num+Car}
.
6th Intex Workshop, Sofia 28-30 May 2003
24
Simple perl script maps Serbian Intex
codes to MULTEX-East MSD
if (($POS eq "V") && ($kategorije !~ /[XS]/)) { #glagol je
$glagol = "V" . "---------------";
if ($semkat =~ /Aux/) { #tip, atribut 1
substr($glagol,1,1) = "a";
} else {
substr($glagol,1,1) = "m"; }
if ($kategorije =~ /([WYGTIFA])/ ) { # forma, atribut 2
substr($glagol,2,1) = $1; }
$glagol =~ tr/WYGTIFA/nmppiii/;
if ( ($lema eq "biti") && ($kategorije =~ /A/) ) {
substr($glagol,2,1) = "c"; }
if ($kategorije =~ /([PIFAGY])/) { # vreme, atribut 3
substr($glagol,3,1) = $1; }
$glagol =~ tr/PIFAGY/pofasp/;
if ($kategorije =~ /([xyz])/) { # broj, atribut 4
substr($glagol,4,1) = $1; }
$glagol =~ tr/xyz/123/; ........
6th Intex Workshop, Sofia 28-30 May 2003
25
Tagged Serbian 1984 using
MULTEXT-East MSD
<w
<w
<w
<w
<w
<w
<w
<w
<w
<w
<w
<w
lemma="biti" ana="Vmps-sman-n---p">Bio</w>
lemma="jesam" ana="Va-p3s-an-y---p">je</w>
lemma="vedar" ana="Afpms1n">vedar</w>
lemma="i" ana="Ccs">i</w>
lemma="hladan" ana="Afpms1n">hladan</w>
lemma="aprilski" ana="Aopms1y">aprilski</w>
lemma="dan" ana="Ncmsn--n">dan</w>
lemma="na" ana="Sps-">na</w>
lemma="cyasovnik" ana="Ncmpl--n">cyasovnicima</w>
lemma="jesam" ana="Va-p3s-an-y---p">je</w>
lemma="izbijati" ana="Vmps-snan-n---e">izbijalo</w>
lemma="trinaest" ana="Mc---l">trinaest</w>
6th Intex Workshop, Sofia 28-30 May 2003
26
Conclusion

It is possible to convert from Intex to
MULTEXT-East

It is possible to convert from MULTEXT-East
to Intex to certain extent. Some information
can not be recovered, such as inflectional
class code
6th Intex Workshop, Sofia 28-30 May 2003
27
Noun attributes
1.
2.
3.
4.
5.
Type
Gender
Number
Case
Definitness
6.
7.
8.
9.
10.
Clitic
Animate
Owner_Number
Owner_Person
Owned_Number
Type attributes
Types
6th Intex Workshop, Sofia 28-30 May 2003
28
Verb Attributes
1.
2.
3.
4.
5.
6.
7.
Type
VForm
Tense
Person
Number
Gender
Voice
8.
9.
10.
11.
12.
13.
14.
Negative
Definitness
Clitic
Case
Animate
Clitic_s
Aspect
Type attributes
Types
6th Intex Workshop, Sofia 28-30 May 2003
29
Adjective attributes
1.
2.
3.
4.
5.
6.
Type
Degree
Gender
Number
Case
Definitness
7.
8.
9.
10.
11.
12.
Clitic
Animate
Formation
Owner_Number
Owner_Person
Owned_Number
Type attributes
Types
6th Intex Workshop, Sofia 28-30 May 2003
30
Adverb attributes
1.
2.
3.
4.
5.
6.
Type
Degree
Clitic
Number
Person
Wh_Type
Type attributes
Types
6th Intex Workshop, Sofia 28-30 May 2003
31
Values of the attribute Vform of the
type Verb





indicative (m)
subjunctive (s)
imperative (m)
conditional (c)
infinitive (i)





participle (p)
gerund (g)
supine (u)
transgressive (t)
quotative (q)
Verb attributes
6th Intex Workshop, Sofia 28-30 May 2003
32
Value of the attribute Tense of the
type Verb






present (p)
imperfect (i)
future (f)
past (s)
pluperfect (l)
aorist (a)
Verb attributes
6th Intex Workshop, Sofia 28-30 May 2003
33
Descargar

Slide 1