156x Filetype PDF File size 0.18 MB Source: www.zora.uzh.ch
Zurich Open Repository and Archive University of Zurich University Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2003 German prepositions and their kin. A survey with respect to the resolution of PP attachment ambiguities Volk, Martin Abstract: This paper surveys German prepositions and their relatives: contracted prepositions, pronomi- nal adverbs, and reciprocal pronouns. We elaborate on corpus frequencies for these and on their properties with respect to PP attachment. We show that prepositions and contracted prepositions can be handled together. They show an overall attachment tendency towards the noun. But pronominal adverbs and reciprocal pronouns show an overall attachment tendency towards the verb and therefore must be treated separately. Posted at the Zurich Open Repository and Archive, University of Zurich ZORAURL:https://doi.org/10.5167/uzh-20340 Conference or Workshop Item Originally published at: Volk, Martin (2003). German prepositions and their kin. A survey with respect to the resolution of PP attachment ambiguities. In: Workshop on The Linguistic Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications, Toulouse, 2003. German prepositions and their kin. A survey with respect to the resolution of PP attachment ambiguities Martin Volk Stockholm University Department of Linguistics SE-10691 Stockholm volk@ling.su.se Abstract weekly computer science newspaper. In ad- This paper surveys German prepositions dition to this training corpus, we prepared and their relatives: contracted prepositions, a 3000 sentence corpus with manually an- pronominal adverbs, and reciprocal pro- notated syntax trees. From this treebank nouns. We elaborate on corpus frequencies we extracted over 4000 test cases with am- for these and on their properties with respect biguously positioned PPs for the evaluation to PP attachment. We show that prepo- of the disambiguation method. We will call sitions and contracted prepositions can be these test cases the ‘CZ test set’. handled together. They show an overall at- As a basis for this study we surveyed Ger- tachment tendency towards the noun. But man prepositions and their relatives and we pronominal adverbs and reciprocal pronouns checked for prepositions, contracted prepo- show an overall attachment tendency to- sitions, pronominal adverbs and reciprocal wardstheverbandthereforemustbetreated pronouns whether they can mutually benefit 1 from each other with respect to attachment separately. tendencies. Keywords: Corpus linguistics, ambigu- ity resolution, unsupervised learning 2 German prepositions 1 Introduction Prepositions in German are a class of words Any computer system for natural language relating linguistic elements to each other processing has to struggle with the problem with respect to a semantic dimension such of ambiguities. If the system is meant to ex- as local, temporal, causal or modal. They tract precise information from a text, these do not inflect and cannot function by them- ambiguities must be resolved. One of the selves as a sentence unit (cf. [Bußmann, mostfrequent ambiguities arises from the at- 1990]). But, unlike other function words, a tachment of prepositional phrases (PPs). A German preposition governs the grammati- PP that follows a noun (in English or Ger- cal case of its argument (genitive, dative or man) can be attached to the noun or to the accusative). Frequent German prepositions verb. We did an in-depth study on unsu- are an, fur,Ä in, mit, zwischen. pervised statistical methods to resolve such Prepositions are considered to be a closed ambiguities in German sentences based on word class. Nevertheless it is difficult to de- cooccurrence values derived from a shallow termine the exact number of German prepo- parsed corpus (see [Volk, 2001] and [Volk, sitions. [SchrÄoder, 1990] speaks of “more 2002]). than 200 prepositions”, but his “Lexikon Corpus processing consisted of proper deutscher PrÄapositionen” lists only 110 of name recognition and classification, Part- them. In this dictionary all entries are of-Speech tagging, lemmatization, phrase marked with their case requirement and chunking, and clause boundary detection. their semantic features. For instance, ohne We used a corpus of more than 5 million requires the accusative and is marked with words from the Computer-Zeitung (CZ), a the semantic functions instrumental, modal, conditional and part-of.2 1This paper is based on my research at the Uni- 2See also [Klaus, 1999] for a detailed comparison versity of Zurich in a project supported by the Swiss National Science Foundation under grant 12- of the range of German prepositions as listed in a 54106.98. number of recent grammar books. The lexical database CELEX [Baayen et The most frequent homographic func- al., 1995] contains 108 German prepositions tions are separable verb prefix and conjunc- with frequency counts derived from corpora tion. Fortunately, these functions are clearly of the “Institut furÄ deutsche Sprache”. This marked by their position within the clause. results in the arbitrary inclusion of nÄordlich, A clause conjunction usually occurs at the nordÄostlich, sudÄ lich while Äostlich and west- beginning of a clause, and a separated verb lich are missing. prefix mostly occurs at the end of a clause Searching through 5.5 million tokens of (rechte Satzklammer). A part-of-speech tag- our tagged computer magazine corpus we ger can therefore disambiguate these cases.5 found around 540,000 preposition tokens Typical (i.e. frequent) prepositions are 3 corresponding to 99 preposition types. monomorphemic words (e.g. an, auf, fur,Ä in, These counts do not include contracted mit, ubÄ er, von, zwischen). Many of the less prepositions. A list of the 66 most frequent frequentprepositionsarederivedorcomplex. German prepositions with frequencies from Theyhaveturnedintoprepositionsovertime our corpus can be found in appendix A. andstill show traces of their origin. They are An early frequency count for German by derived from other parts-of-speech such as [Meier, 1964] lists 18 prepositions among the 100 most frequent word forms. 17 out of ² nouns (e.g. angesichts, zwecks), these 18 prepositions are also in our top-20 ² adjectives (e.g. fern, unweit), list. Only gegen is missing which is on rank 23 in our corpus. This means that the usage ² participle forms of verbs (e.g. of the most frequent prepositions is stable entsprechend, wÄahrend; ungeachtet), or over corpora and time. All frequent prepositions in German have ² lexicalized prepositional phrases (e.g. some homograph serving as anhand, aufgrund, zugunsten). ² separable verb prefix (e.g. ab, auf, mit, German prepositions typically do not al- zu), low compounding. It is generally not possi- ² clause conjunction (e.g. bis, um)4, ble to form a new preposition by a concate- nation of prepositions. The two exceptions ² adverb (e.g. auf, fur,Ä ubÄ er) in often id- are gegenubÄ er and mitsamt. Other concate- iomatic expressions (e.g. auf und davon, nated prepositions have led to adverbs like ubÄ er und ubÄ er), inzwischen, mitunter, zwischendurch. ² infinitive marker (zu), [Helbig and Buscha, 1998] call the monomorphemic prepositions primary ² proper name component (von), or prepositions and the derived preposi- tions secondary prepositions. This ² predicative adjective (e.g. an, auf, aus, distinction is based on the fact that only in, zu as in Die Maschine ist an/aus. primary prepositions form prepositional Die TurÄ ist auf/zu.). objects, pronominal adverbs (cf. section 2.2) 3These figures are based on automatically as- and prepositional reciprocal pronouns (cf. signed part-of-speech tags. If the tagger systemat- section 2.3). ically mistagged a preposition, the counting proce- In addition, this distinction corresponds dure does not find it. In the course of the project to different case requirements. The primary we realized that this happened to the prepositions prepositions govern accusative (durch, fur,Ä a, via and voller as used in the following example gegen, ohne, um) or dative (aus, bei, mit, sentences (all examples in this paper are from the nach, von, zu) or both (an, auf, hinter, in, Computer-Zeitung, Konradin-Verlag, 1993-1997). (1) Derselbe Service in der Regionalzone (bis neben, ubÄ er, unter, vor, zwischen). Most zu 50 Kilometern) kostet 23 Pfennig a 60 of the secondary prepositions govern gen- Sekunden. itive (angesichts, bezuglich,Ä dank). Some (2) Master und Host kommunizieren via IPX. 5Note the high degree of ambiguity for zu which (3) Windows steckt voller eigener Fehler. can be a preposition zu ihm, a separated verb prefix sie sieht ihm zu, the infinitive marker ihn zu sehen, a 4[Jaworska, 1999] (p. 306) argues that “clause- predicative adjective das Fenster ist zu, an adjectival introducing preposition-like elements are indeed or adverb marker zu gross, zu sehr, or the ordinal prepositions”. number marker sie kommen zu zweit. prepositions (most notably wÄahrend) are in the probability estimates in [Ratnaparkhi, the process of changing from genitive to da- 1998] except that Ratnaparkhi includes a tive. Some prepositions do not show overt back-off to the uniform distribution for the case requirements (je, pro, per; cf. [Schaeder, zero denominator case. We added special 1998]) and are used with determiner-less precautions for this case in our disambigua- noun phrases. tion algorithm. The cooccurrence values are Some prepositions show other idiosyncra- also very similar to the probability estimates cies. The preposition bis often takes another in [Hindle and Rooth, 1993]. preposition (in, um, zu as in 4) or combines We started by computing the cooccur- with the particle hin plus a preposition (as rence values over word forms for nouns, in 5). The preposition zwischen is special in prepositions, and verbs based on their part- that it requires a plural argument (as in 6), of-speech tags. In order to compute the pair often realized as a coordination of NPs (as frequencies freq(N1;P), we search the train- in 7). ing corpus for all token pairs in which a noun is immediately followed by a preposi- (4) Portables mit 486er-Prozessor tion. The treatment of verb + preposition werden bis zu 20 Prozent billiger. cooccurrences is different from the treatment (5) ... und berucksichtigtÄ auch Daten of N+P pairs since verb and preposition are und Datentypen bis hin zu Arrays seldom adjacent to each other in a German oder den Records im VAX-Fortran. sentence. On the contrary, they can be far apart from each other, the only restriction (6) Die Verbindungstopologie zwischen being that they cooccur within the same den Prozessoren lÄaßt sich als clause. We use the clause boundary infor- dreidimensionaler Torus darstellen. mation in our training corpus to enforce this restriction. For computing the cooccurrence (7) Durch Microsoft Access mussenÄ sich values we accept only verbs and nouns with die Anwender nicht mehr lÄanger an occurrence frequency of more than 10. zwischen Bedienerfreundlich- WiththeN+PandV+Pcooccurrenceval- keit und Leistung entscheiden. ues for word forms we did a first evaluation over the CZ test set with the following sim- Results for PP attachment ple disambiguation algorithm. We explored various possibilities to extract PPdisambiguation information from the au- if ( cooc(N1,P) && cooc(V,P) ) then tomatically annotated CZ corpus. We first if ( cooc(N1,P) >= cooc(V,P) ) then used it to gather frequency data on the cooc- noun attachment currence of pairs: nouns + prepositions and else verbs + prepositions. verb attachment The cooccurrence value is the ra- tio of the bigram frequency count We found that we can only decide 57% freq(word;preposition) divided by the of the test cases with an accuracy of 71.4% unigram frequency freq(word). For our (93.9% correct noun attachments and 55.0% purposes word can be the verb V or the correct verb attachments). This shows a reference noun N1. The ratio describes striking imbalance between the noun attach- the percentage of the cooccurrence of ment accuracy and the verb attachment ac- word + preposition against all occurrences curacy. This imbalance was countered with of word. It is thus a straightforward a noun factor which was automatically de- association measure for a word pair. The rived from the corpus based on the overall cooccurrence value can be seen as the attachmenttendencyofprepositionstowards attachment probability of the preposition nouns in comparison to their tendency to- based on maximum likelihood estimates. wards verbs (cf. [Volk, 2002]). This move Wewrite: leads to an improvement of the overall at- tachment accuracy to 81.3%. We then went cooc(W;P) = freq(W;P)=freq(W) on to lemmatize all word forms which also with W ∈ {V;N }. The cooccurrence val- included mapping contracted prepositions to 1 their corresponding bare forms. ues for verb V and noun N1 correspond to
no reviews yet
Please Login to review.