Compositional vs. Frozen Sequences Jorge Baptista University of Algarve, Portugal email@example.com http://w3.ualg.pt/~jbaptis Lexicon-Grammar Workshp, Beijing, 16-17 Oct. 2004 1. Introduction Compound words and frozen expressions constitute a major part of the lexicon of many languages. Their definition is not easy, and conceptual and terminological discussions abound in the literature. Traditionally defined on semantic grounds criterion of non-compositionality, the global meaning of a multiword expression can not be calculated based on the meaning of its individual elements when they are used separately in the language. formal, syntactic (or combinatorial) constraints. semantically ‘opaque’ compound words dog-collar, dogfight only ‘half opaque’ compound words : dogfish , fish knife , half-life semantically ‘transparent’ compound words heavy element , <date> before present (present =1950). spelling rules –are just writing conventions (orthography consecrates writing habits) fish knife / fish-knife, fish finger / fish-finger Formal constraints on word combinations (non semantically motivated): e.g. the set of time-related nouns (dawn, morning, afternoon, sunset, evening, night), and prepositions, determiners or modifiers. at noon / *at morning in the evening / *on the evening in the morning / *in morning by morning / by the morning meaning of individual, isolated words. meaning of a word is related to the word’s syntax, i.e. the words it co-occurs with. determining the meaning of a given word by inserting it in several, different sentences and, by carefully controlling formal changes on those sentences, looking for changes (or invariance) in meaning. Disagreement about ‘transparent’, halftransparent’ or even ‘opaque’ word-combinations. Intuitions about meaning are almost always vague and too imprecise to be used in a reproducible way. rather use syntactic, formal criteria to identify compounds, Show that words are ‘frozen’ together, even if the meaning of the combination is relatively ‘transparent’. ‘frozen’ = two or more elements of the expression do not show any distributional variation. e.g. the set of time-related nouns unpredictable blocking of distributional variation acceptable combinations have to be included in the lexicon therefore they should be treated as compound lexical units. Every part-of-speech (PoS) shows both simple and compound words. For example, word-combinations such as the man in the street could very well be accounted as an indefinite pronoun (similar to everyone): Politicians always cared about the opinion of the man in the street Usually, many compound prepositions and conjunctions have already been included in current dictionaries: John stopped in the middle of the street John came to Paris by way of Madrid John came to Paris in spite of my warnings against it John came to Paris because of my warnings There are some (productive?) rules to produce compound adjectives: -like : to be life-like, Algol-like languages -proof : to be (bullet + water + …) -proof Other compound adjectives are frozen on purely combinatorial ways: John is (sick and tired + *tired and sick) of that Moreover, in English, verb + particle combinations forming phrasal verbs, can be considered a especial case of compound verb: John ran (for a mile) John ran away (to Brazil) The batteries are running down John ran into Mary John ran off to Brazil John ran off with a book John’s lecture ran on The printer ran out of paper The truck ran over the dog John ran through the entire proceeding Some compound words can be described in a regularly way, by means of finite-state transducers, as, for example, the (potentially infinite) set of compound numerals: twenty-one, one hundred and twenty-one, twenty-one thousand two hundred and twenty-one … High number of compound words in texts, particularly in scientific and technical texts meaning units must be identified as a block and not as a string of simple words. unpredictable overall meaning, that cannot be directly calculated from the meaning their internal elements. In this lecture, we will focus on syntactic properties that can be used to identify compounds. Being a major part of many languages’ lexicon, the task of retrieving and describing them into dictionaries is not trivial, especially if these dictionaries are meant to be used in natural language processing. many statistical methods to retrieve compound (or multiword) lexical units from texts, the linguist’s task : to validate those word combinations as compound lexical units and to build the dictionaries for them. In order to do this, linguists have to rely on syntactical properties, which can only be done by learning the language’s syntactic general rules. It is only then that linguists can find out the combinatorial constraints on those rules shown by multiword expressions. This presentation is structured in two parts: first we will present some of the major syntactical properties distinguishing compound nouns from ordinary noun phrases; and in the second part we will give some examples of how the same methodology can be applied to the identification of compound adverbs. 1. Compound nouns. Probably the most known case of compounding, compound nouns constitute the largest of all compound word classes. In every domain (scientific, technical, economical, political, etc.) there is a constant need for coining new denominations for new objects, tools, concepts, products and so on, the nouns being the most natural part-of-speech (PoS) to accommodate such new designations. compound nouns formed by sequences of grammatical categories identical to those appearing in ordinary (i.e. not frozen) noun phrases: a nice dog (a dog) a hot dog (a sandwich) a square table (a table) a square root (a mathematical function) Adam’s orange (an orange) Adam’s apple (a part of the human body) differences between compounds and free word combinations this distinction is not as clear-cut as dictionaries and grammars sometimes could lead one to believe. This presentation will show some of the basic syntactic properties that can help distinguishing compounds from free word combinations. compounding in the framework of traditional grammar studies (Morphology). Lexicon-grammar approach: compounds are described with the very same tools used to describe the syntax of noun phrases. In order to identify a compound as such it is necessary to check if that particular word combination shows any constraints to the combinatorial properties that one would expect to find in a noun phrase (NP) formed by the same internal PoS sequence (G. Gross 1988, 1989). compare the grammar of noun phrases to syntactical properties of a word-combination candidate for the status of compound word. our examples here will consist of already well-known compound nouns. By analogy, the same methodology can be extended to other, more complex, word combinations. Let’s take the examples square table / square root. In a free NP with the internal structure Adjective + Noun (AN), where the adjective is often a free modifier of the noun, the predicative function of the adjective on the noun is an explicit paraphrase with relative clause with auxiliary verb be: a square table : a table that is square This is not the case with the compound square root: a square root : *a root that is square and also with many other compound nouns where we say that the adjective looses his predicativity. Also, free adjectives can be further modified by an adverb: a square table : a perfectly square table a table that is perfectly square but: a square root : * a perfectly square root *a root that is perfectly square When the AN combination is free, both the adjective and the noun can vary, provided that basic distributional constraints are respected. Therefore, table can be replaced by other nouns: a square (table + door + carpet + …) in the same way as square can be replaced by other distributionally similar adjectives: a (square + oval + triangular + oblong + …) table However, when an AN combination forms a compound noun, distributional variation is blocked: a square (root + *twig + *branch + …) a (square + *oval + *triangular + *oblong + …) root Ambiguous strings round table (free combination or compound noun). only syntactic environment may help to disambiguate it: I have bought a round table for my dining room (a piece of furniture) I have attended a round table on French syntax (an event) Even if many compound nouns are ambiguous with free word combinations, usually they are much less ambiguous then simple words. in free NP, adjectives are just facultative modifiers of the noun. They can be deleted without changing the overall meaning of the NP (nor the meaning of the sentence where the NP is inserted): John bought a (E + square) table However, with some abstract nouns that express predicates and are hence called predicative nouns (M.Gross 1981; see below), the presence of a modifier is often obligatory (Meunier 1981; Giry-Schneider 1995; Laporte 1997): He had an immense esteem for tradition (Henry James, Portrait of a Lady) *He had esteem for tradition *He had an esteem for tradition When the adjective is not a mere modifier of the noun, usually it cannot be deleted, for it is the AN combination that forms a compound lexical unit. This is particularly clearer with semantically opaque compound nouns: John attended a round table on Chinese Syntax *John attended a table on Chinese Syntax John calculated the square root of 9 *John calculated the root of 9 But in some compounds, even frozen adjectives can be deleted. For example, most of the times people calculate square roots, so that in some languages – Portuguese, for instance –, unless otherwise stated, the adjective quadrada (equivalent to square) can be zeroed without loss of information: O João calculou a raiz (E + quadrada) de 9 (John calculated the (E + square) root of 9) In many other cases, however, the adjective in a compound noun functions as a classifier of the noun, distinguishing a particular type of object: John likes to drink (red + white + … ) wine In this case, the adjective can be zeroed, with some loss of information: John likes to drink (E + red) wine The classifying function of an adjective can be detected by means of classifying sentences: A red wine is a type of wine NP with free modifiers cannot enter classifying sentences: *A square table is a type of table Of course, compound nouns cannot enter these sentences either: *A square root is a type of root When an adjective functions as a modifier, it is sometimes possible to see a (usually) small distribution paradigm: John calculated the (square + cubic) root of that value John likes to drink (red + white + … ) wine which is closed for distributional variation: John calculated the (square + cubic + *triangular + *spherical) root of that value John likes to drink (red + white + *yellow + *blue… ) wine In this sense, AN combinations where the adjective is a classifier can be described as compound nouns. The extension of distributional paradigm of the classifier adjective can be rather large (acids) and open to the coining of new terms; or relatively small (teeth and vertebrae) and closed to further additions: John poured some (ascorbic + citric + nitric + … ) acid into the solution The dentist repaired one of my (incisive + canine + molar + …) teeth John was injured in one of his (cervical + lumbar + …) vertebrae in the compounds of wine, one finds that many toponyms (Ntop) designating wine-producing regions can replace wine: John likes to drink a glass of (wine + Porto + Bordeaux + …) These combinations can be derived from a deleted occurrence of wine : John likes to drink a glass of (E + Porto + Bordeaux + …) wine The number of Ntop wine combinations is very large (every wine region), but highly conventional, determined by extra-linguistic factors. Extensive lists can be made, but of small linguistic interest. Some adjectives combine in a highly exclusively way with a very short set of nouns (often only one): This noun is inflected in the nominative case In these cases, the noun of some AN compounds (but not all) can be zeroed, leaving the adjective in a (superficial) noun slot: This noun is inflected in the nominative (E + case) The dentist repaired my (canine + molar +…)(E + tooth) with less ‘exclusive’ adjectives, N can be zeroed depending on the syntactic context: John prefers to drink red (E + wine) to white (E + wine) This is probably one of the reasons why dictionaries have classified so many adjectives both as adjectives and nouns (see M. Gross 1998 for further discussion of this subject). This is not always the case: John was injured in a (*cervical + *lumbar + …) or it may depend on the language and the NA involved. For Portuguese, for instance, zeroing of N in a similar case is observed with some Adj but not others: O João ficou ferido numa (E + vértebra) (cervical + *dorsal + *lombar + *sacra) A particular case of AN combinations : relation adjectives, i.e. adjectives derived from nouns, such as presidential (from President). These adjectives never allow the formation of the relative clause, neither the insertion of an adverbial modifier: The presidential address to the Congress *The address to the Congress that was presidential *The very presidential address to the Congress <was very disturbing> Nouns such as address express predicates and are therefore called predicative nouns. (M. Gross 1981) Relation adjectives, such as presidential, when combined with predicative nouns, do not function as mere modifiers of the noun. Instead, they are derived from a complement NP: The President’s address to the Congress < was very disturbing > In this sentence, President is interpreted as an argument (in this case, the subject) of the predicative noun address. This syntactic and semantic relation between the two nouns (President – address) is of the same nature as the relation between a subject and verb, and it has a formal counterpart in the sentence: The President made an address to the Congress We consider this to be an elementary sentence, the predicative node is the noun address, which selects its two arguments (President, Congress). In this sentence, to make is a support verb (Vsup; also called light verb): it is devoid of meaning and it functions as a morphological tool to actualize the predicative noun, carrying the tense morphemes that the noun cannot express. Now, the adjective presidential can enter many other AN combinations, involving predicative nouns: The presidential campaign <…> However, some of these combinations cannot be derived from the reduction of support verb sentences. In fact, the NP: The presidential campaign above is ambiguous : (a) ‘the campaign that the President is making’, NP is equivalent to: The president’s campaign <has been extremely violent> b) it is a campaign where many people run for the office of President (and not necessarily the President himself), NP can appear in sentences such as: The presidential campaign <takes place in September> Notice that the regularly derived NP cannot appear in this context: *The president’s campaign takes place in September It is therefore necessary to study in detail the properties of all AN combinations where Adj is a relational adjective and N a predicative noun in order to determine if this combination can be regularly derived from an elementary sentence with a support verb or, else, if this derivation is blocked in some way, and has become a compound noun. (A. Monceaux 1999) The next case illustrates a curious type of blocking involving relation adjectives. relational adjectives: solar (sun) or lunar (moon) AN noun phrases regularly derived from elementary sentences where moon or sun are an argument of a predicative noun, such as eclipse: the eclipse of the (moon + sun) <lasted 20 minutes> the (lunar + solar) eclipse <lasted 20 minutes> ?*the (moon + sun)’s eclipse <lasted 20 minutes> *the (moon + sun) eclipse <lasted 20 minutes> There are, however, many AN combinations that one cannot derive from moon or sun: the lunar month <lasts 28 days> *the moon’s month <lasts 28 days> *the month of the moon <lasts 28 days> *the moon month <lasts 28 days> the solar year <lasts 365,25 days> *the sun’s year <lasts 365,25 days> *the year of the sun <lasts 365,25 days> ?*the sun year <lasts 365,25 days> Finally, some compounds show morphosyntactic constraints: while their elements can vary in gender or/and number when used independently, together they do not show any variation. For example, national waters, is always used in the plural, in spite of the uncountable nature of water: They prevented the ship from entering (national waters + *national water) There is a certain degree of institutionalization in compounding. Sometimes several, different structures may be available in the language in order to designate the same concept or object, but the language retains only one of them. ‘machine used to take photographs’ : photographic machine (AN) photographing machine (V-ing N, as in washing machine) photo(graph) machine (NN, as in copy machine) photographier (N-er, as in photocopier) Instead, it is the simple word camera that is used to name this object. When comparing different languages, one finds out that each may adopt a different strategy, hence: FR: appareil photo (NN) ‘photo aparatus’ *appareil à photographier (N à V), *appareil photographique (NA) *photograph(i)euse / *photograph(i)eur (N-eur) PT: máquina fotográfica (NA) ‘photographic machine’ *máquina de fotografar (N de V) * foto-máquina (NN) * fotografiadora (N-ora)/*fotografadora (V-ora) In view of these language differences, many dictionaries used in machine translation may have to include some word combinations regardless of its semantic transparence. When describing different types of compound nouns, different syntactic properties have to be used to determine their degree of formal frozenness. These properties are the very same that are used to describe the syntactic relations between the elements of a free noun phrase. Compound nouns differ from free noun phrases in that they do not admit some (or any) of these properties. 2. Compound Adverbs. compound adverbs pose similar problems Simple adverbs are already included in dictionaries (if we do not consider the adverbs regularly derived from adjectives with suffix –ly: rapidly), but many compound adverbs were just left out or, else, are described as mere expressive word combinations with no particular lexical status. adverbial status of a phrase, replaced by simple adverbs: John is reading Shakespeare (now + at this moment) For the most part they are formally identical to prepositional phrases, but several combinatorial constraints hold between two or more of their elements. Usually the resulting overall meaning of the expression can not be calculated from the sum of the meaning of its internal elements. Thus, we find several time adverbs formed with time-related noun moment: <That happened> at (this + that + the) moment <I was doing this> for the moment <I didn’t believe it> for a moment <I did it> on the spur of the moment <I did it> not a moment too soon the combination of preposition and noun is frozen. If we would replace moment for another, almost synonymous word, instant, most of these combinations become unacceptable: <That happened> at (this + that + *the) instant <I did it> *for the instant <I didn’t believe it> for an instant <I did it> *on the spur of the instant <That happened> ?not an instant too soon Several adverbs look like an ordinary noun phrases: One moment John was reading quietly, the next moment he was crying Some of these NP-like adverbs may derive from the deletion of a preposition, while others do not: (At + *on + E) one moment John was reading quietly, (?*at + ?*on + E) the next (E + moment) he has crying current spelling of many simple adverbs denounces their former condition of phrases: John goes jogging (everyday + every night) The determiner of the noun can sometimes present some formal variation, as in: at (this + that + the) moment, for (a + one) moment but it becomes frozen when its replacement involves a clear change in the overall meaning: John is reading Shakespeare for the moment I believed for a moment John that was reading Shakespeare In some adverbs the preposition and the noun may be frozen but the noun allows for the insertion of modifiers: <That happened> at that unfortunate moment <That happened> at the moment we are speaking <That happened> at this (precise + exact) moment Some of these insertions may also be frozen: <That happened> at (this + that + *the) very moment <That happened> at the (last + *first) moment <That happened> *at this (imprecise + inexact) moment or depend on the determiner-modifier combinations involved (for example, a definite article and a relative clause): <That happened> at (*this + *that + the) very moment I was speaking Other constraints on formal variation can be found: <John arrived> not an moment too (soon + *late) Some subordinate clauses function as frozen adverbs (M. Gross 1986) : <John will stay in his post> until hell freezes over (= forever) <John will only get my post> when hens get teeth (= never) <John will only get my post> when pigs fly (= never) In these examples, one cannot change any element of the (frozen) subordinate clause. Particular cases of frozen subordination are comparative frozen adverbs, modifying verbs or adjectives: <John moves> like a bull in a china shop (clumsily) <John cried> like Magdalen (very much) <The crowd rose to its feet> as one man (together, at the same time) <John is as fast> as a bullet (= very fast) <John is as white> as a sheet (= very white) Notice, in some cases, the absence of the first comparative particle: <John is deaf> as a post Some compound adjectives may have been formed from such comparative structures: John is stone deaf John is deaf (as + like) a stone but others do not admit this paraphrase: *John is post deaf *John is bullet fast *John is sheet white There are several compound adverbs that select (or modify) only a limited set of verbs (or predicates): John (knows + learned + recited) the poem by heart the adverb man-to-man can only modify SPEAK-like verbs: John (spoke + talked) man-to-man to Paul However, there are often many distributional, unpredictable constraints: *John (chatted + whispered) man-to-man to Paul *John gossiped man-to-man with Paul Certain verb-adverb combinations are so constraint that the adverb can only modify a single verb: John heard that (E + straight) from the horse’s mouth (directly from a bona fide source) Adverbs are facultative modifiers of the verb and can usually be zeroed or replaced by other, simple word adverbs, but these highly constraint combinations are closer to frozen sentences. Therefore, linguistic description of compound adverbs is not just a matter of showing their internal word combination constraints. It also involves representing the way they interact with the other sentence’s elements. In this sense, it is, therefore, not very much different from describing the syntax of simple adverbs. 3. Conclusions The theoretical and methodological framework of Lexicon Grammar has demonstrated the quantitative importance of compounding in the many languages’ lexicon. Using formal criteria to identify compound words made clear that most of them show an internal PoS structure similar to that of ordinary phrases. Comparing the syntax of free combinations with restrictions on those formal properties proved to be the most correct way identifying compounds without having to rely on vague, imprecise, and irreproducible meaning intuitions. At the same time, it is the very grammar of the language that comes under scope. Compounds are not just bizarre word combinations; they are a clue to the language’s grammar. Finally, by adopting a formal, taxonomical approach and by the careful construction of linguistic resources, Lexicon-Grammar enables researchers working on different languages to compare their inventories and their respective syntactic properties (M. Gross 1984; J. Labelle (ed.)1995). These comparative studies constitute a solid base for many NLP, lexicographic or didactic applications, and eventually for future machine translation. Bibliography ACL, 2003. Proceedings of the Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Sapporo, Japan: ACL; 2004. Proceedings of the Workshop on Multiword Expressions: Integrating Processing. Barcelona, Spain: ACL Courtois, B. ; Garrigues, M. ; Gross, G. ; Gross, M. ; Jung, R. ; Mathieu-Colas, M. ; Silberztein, M. ; Vivès, R. 1997. Dictionnaire électronique des noms composés DELAC : Les composants NA et NN. Rapport Technique du LADL nº 55, Paris : LADL. Gross, G., 1988. Degré de figement dans les noms composés. Langages 90 : 57-72. Paris : Larousse. Gross, G., 1990. Définition des noms composés dans un lexique-grammaire. Langue Française 87, Paris : Larousse. Gross, G., 1996. Les expressions figées : noms composés et d’autres locutions. Paris : Ophrys. Gross, M., 1984. A linguistic environment for comparative romance syntax. Papers from the 12th Linguistic Symposium on Romance Languages, P. Baldi (ed.). pp. 373-416. Amsterdam/Philadelphia: John Benjamins. Gross, M., 1975. Méthodes en Syntaxe. Paris: Hermann. Gross, M., 1981. Les bases empiriques de la notion de prédicat sémantique. Langages 63 : 7-52. Paris : Larousse. Gross, M., 1986. Grammaire transformationnelle du français. 3- Syntaxe de l’verbe. Paris : ASSTRIL. Labelle, J. (ed.), 1995. Lexiques-Grammaires comparés et traitements automatiques. Linguvisticae Investigationes Supplementa. Amsterdam /Philadelphia: John Benjamins. Ranchhod, E.; De Gioia, M. 1996. Comparative Romance Syntax: Frozen adverbs in Italian and in Portuguese. Lingvisticae Investigationes XX-1: 33-85. John Benjamins.