120x Filetype PDF File size 0.43 MB Source: www.lrec-conf.org
Challenges and Solutions for Consistent Annotation of Vietnamese Treebank 1&2 1&2 3 4 QuyT.Nguyen , Yusuke Miyao , Ha T.T. Le , Ngan L.T. Nguyen 1TheGraduate University for Advanced Studies (SOKENDAI), Japan 2National Institute of Informatics, Japan 3University of Social Sciences and Humanities, Vietnam 4 University of Information Technology, Vietnam quynt@nii.ac.jp, yusuke@nii.ac.jp, trucha.ussh@gmail.com, ngannlt@uit.edu.vn Abstract Treebanks are important resources for research in natural language processing, speech recognition, theoretical linguistics, etc. To strengthen the automatic processing of the Vietnamese language, a Vietnamese treebank has been built. However, the quality of this treebank is not satisfactory and is a possible source for the low performance of Vietnamese language processing. We have been building a new treebank for Vietnamese with about 40,000 sentences annotated with three layers: word segmentation, part-of-speech tagging, and bracketing. In this paper, we describe several challenges of Vietnamese language and how we solve them in developing annotation guidelines. We also present our methods to improve the quality of the annotation guidelines and ensure annotation accuracy and consistency. Experiment results show that inter-annotator agreement ratios and accuracy are higher than 90% which is satisfactory. Keywords:Vietnamese Treebank, Consistent Annotation, Challenges and Solutions Treeing a Vietnamese sentence 1. Introduction Original sentence: Treebanks–corpora annotated with syntactic structures, are Nam kểvềtai nạn hôm qua. {Nam tells about the yesterday's accident.} importantresourcesforresearchersinnaturallanguagepro- cessing (NLP). Treebanks provide important syntactic in- 1. Word segmentation: formation in order to improve the quality of NLP tools. To Nam kể về tai_nạn hôm_qua . to tell about accident yesterday strengthen the automatic processing of the Vietnamese lan- 2. POS tagging: guage, Nguyen et al. (2009) have built a Vietnamese tree- Nam/Nrkể/Vvvề/Cs tai_nạn/Nn hôm_qua/Nt ./PU bank, named VLSP treebank, containing 10,000 sentences. However, the quality of the VLSP treebank, including the 3. Bracketing: quality of the annotation scheme, the annotation guidelines, (S andtheannotationprocess,isnotsatisfactoryandisapossi- (NP-SBJ (Nr-H Nam)) ble source for the low performance of Vietnamese language (VP (Vv-H kể) processing (Nguyen et al., 2012; Nguyen et al., 2013). (PP-DOB (Cs-H về) We have been building a new Vietnamese treebank with (NP (Nn-H tai_nạn) (NP-TMP (Nt-H hôm_qua))))) 3,000 texts (about 40,000 sentences) covering 14 topics (PU .)) collected from a Vietnamese online newspaper, Thanhnien 1 news . Our treebank is annotated with three layers: word segmentation (WS), part-of-speech (POS) tagging, and Figure 1: An example to illustrate process of treeing a Viet- 2 namesesentence. bracketing as showed in Figure 1 . We have found that en- suringtheannotationconsistencyandaccuracyisoneofthe most important considerations in the annotation of a tree- with other languages (e.g., English and Chinese) to indi- bank. This requires clear and complete annotation guide- cate that building a high-quality Vietnamese treebank is a lines. The guidelines contain the annotation scheme, con- challenging problem. We also present our methodology to sistent principles to annotate linguistic phenomena,andsuf- tackle the challenges in this section. We then discuss dif- ficient examples. These documents are not only used to ficulties in WS, POS tagging, and bracketing, and how we train annotators but also valuable sources serving the uses solve them in developing the annotation guideline in Sec- of the treebank. tion 3, 4, and 5 respectively. Finally, in Section 6, we de- WepreparedthreesetofguidelinesfortheVietnamesetree- scribe our annotation process, how we revise the guidelines bank: WSguidelines, POS tagging guidelines, and bracket- during the annotation process, and methods to ensure the ingguidelines.Inthispaper,Section2describesthegeneral annotation consistency and accuracy. characteristics of the Vietnamese language in comparison This study is not only beneficial for the development of 1http://thanhnien.vn computational processing technologies for Vietnamese, a 2Underscore "_" is used to link syllables of Vietnamese multi- language spoken by over 90 million people, but also for syllable words. Translation for the Vietnamese word is given as similar languages such as Thai, Laos, and so on. This study a subscript. If the Vietnamese word does not have a translatable also promotes the computational linguistic studies on how meaning,thesubscript is blank. Translation for a Vietnamese sen- to transfer methods developed for a popular language, like tence is given in curly brackets below the original text. English, to a language that has not yet intensively studied. 1532 Meaning: The construction unit is too slow. a) S b) S c) S NP-SBJ Cp ADJP-PRD PU SPL Cp SPL PU NP-SBJ ADJP-PRD PU Nn-H Vv thì R Aa-H . NP thì ADJP . R Aa-H . {to be} Nn-H Vv R Aa-H Nn-H Vv Đơn_vị thi_công quá chậm_chạp Đơn_vị thi_công quá chậm_chạp Đơn_vị thi_công quá chậm_chạp {unit} {to construct} {too} {slow} Figure 2: Examples showing ambiguity of annotating a sentence in Vietnamese. 2. Characteristics of Vietnamese language (Xia, 2000b; Xia, 2000a; Xue et al., 2000), English andmethodologyforguideline PennTreebank(Santorini,1990;Biesetal.,1995),and preparation VLSPtreebank (Nguyen et al., 2010b; Nguyen et al., Unlike Western languages, in which blank spaces denote 2010a; Nguyen et al., 2010c) and adapt them to our worddelimiters, in Vietnamese, blank spaces play the roles guidelines if possible. of not only word delimiters but also syllable delimiters 3 (Diep, 2005; SCSSV, 1983) that cause difficulties in defin- • During the annotation process, annotators are re- quested to discuss with us about the constructions that ing words. In addition, unlike English and Japanese, Viet- they cannot annotate or feel ambiguous. These con- namese is not an inflectional language for which morpho- structions are important clues to revise the guidelines. logical forms can provide useful clues for word segmen- • We conduct nine rounds of measurement of inter- tation and POS tagging. While similar problems also oc- annotator agreement and accuracy, for which two an- cur with Chinese (Xia et al., 2000), annotating Vietnamese notators annotate the same data. The inconsistencies words may be more difficult, because the modern Viet- and annotation errors found in each round are impor- namese writing system is based on Latin characters, which tant clues to improve annotation guidelines and to train represent the pronunciation but not the meaning of words, annotators again. resulting in many homonyms. Difficulties in Vietnamese occur in not only determining Details of applying these approaches during the process of wordsasmentionedabovebutalsobracketingphrases.One building the Vietnamese treebank are explained in the fol- of the reasons is that there are many expressions having lowing sections. the same POS sequence but different phrase types in Viet- namese. Other difficulties are caused by the fact that word 3. Wordsegmentationguidelines order in Vietnamese is very flexible. 3.1. Challenges of word segmentation Moreover, there is little consensus in community about how to define words, phrases and grammatical structures. Words are the most basic units of a treebank (Sciullo and Though people agree that Vietnamese is the subject-verb- Williams, 1987), and defining words is the first step in object (SVO) language, Figure 2a shows a sentence in Viet- the annotation process. (Xia, 2000b; Xia, 2000a; Sornlert- namese that the head word of the predicate is not a verb. lamvanich et al., 1999). For languages like English, defin- For sentences that do not have the main verb, we can use ing words is almost trivial, because the blank spaces de- the conjunction thì to link the subject and the predicate as note word delimiters. However, it is a difficult problem in shown in Figure 2b. However, when the conjunction thì is Vietnamese even for a native speaker. Although most lin- used, linguists disagree about how to bracket this sentence. guists agree that the Vietnamese language has two types Diep (2005) considered this sentence as a single sentence of words, single-syllable words (single words) and multi- (Figure 2b), where the conjunction thì is used to link the syllable words (compound words), distinguishing between subject and the predicate. SCSSV (1983), in contrast, con- single and multi-syllable words involves much ambiguity. sidered this sentence as a subordinate compound sentence Theambiguities of Vietnamese WS occur for the following (Figure2c)becausetheysaidthattheconjunctionthìisused reasons. First, in Vietnamese, blank spaces play the roles to link two clauses of a subordinate compound sentence. of not only word delimiters but also syllable delimiters. WepreparedtheguidelinesfortheVietnamesetreebankin- Second, there are no morphological marks to act as impor- cluding three sets: word segmentation guidelines, POS tag- tant clues to identify words. Third, the Vietnamese writ- ging guidelines, and bracketing guidelines. The problems ing system is based on Latin characters, which represent were tackled on the basis of the following approaches: the pronunciation but not the meaning of words. Expres- • We refer to Vietnamese grammar books (SCSSV, sions that have the same surface form but different word 1983; Diep, 2005) and discuss with our collaborators, segmentation appear frequently in Vietnamese. Rows 1 and who are Vietnamese linguistics experts, to solve the 2 in Table 1, for instance, show two different segmentation ambiguities and difficulties. 3Ourtreebankisannotatedbytwoannotatorswhoaregraduate • We study the guidelines of Chinese Penn Treebank linguistics students. 1533 No. Expression (A B) Meaning WS fromwhattheexpressionindicates,A_Bisconsidered 1 quần áo clothes a word trousers shirt as a compound word. In contrast, if B has a similar 2 quần áo trousers 2words meaningtoAB,AandBareconsideredastwowords trousers shirt and shirt 3 ăn nói to speak a word (examples 8 and 9 in Table 1). eat speak 4 tìm kiếm to find a word find find 5 nồi đồng copper pot 2words pot copper • An expression of one or more Sino-Vietnamese sylla- 6 nồi bằng đồng copper pot 3words pot by copper bles and an original Vietnamese word, in which the 7 đen đúa black a word black Sino-Vietnamese syllables are the elements used to 8 cá heopig dolphin a word fish create the new words, is not considered as a word (ex- 9 cá lia_thia betta fish 2words fish bettafish ample 10 in Table 1). 10 nghiên_cứu viên−er researcher 2words research 11 nhà nghiên_cứu researcher 2words −er research • Specialclassifier nounsareconsideredassinglewords Table 1: Examples to illustrate the principles of word seg- (example 11 in Table 1). mentation. It should be noted that these rules do not necessarily con- types of the expression quần áo. Fourth, there is little con- form to the rules used by linguists. For example, Diep sistency in segmenting the expressions. For example, some (2005) considers the Sino-Vietnamese syllable viên in linguists consider the expression cá rô {anabas} −er fish anabas example 10 in Table 1 as a component of the compound as a compound word but bệnh sởi {measles} word and considers the special classifier noun nhà as a illness measles −er as two words (Hoang, 1998; Diep, 2005). However, these single word. We, on the other hand, consider both viên expressions have a similar construction: the combination of −er and nhà−er as single words because we found that they a categorization noun4 and a specific noun. both have the same grammatical function that is forming 3.2. Policy for annotation of word segmentation new words. However, in our guidelines, the word types for which there is little consensus between linguists for seg- As mentioned above, our purpose for word segmentation menting them are annotated with additional information so is to build a treebank for Vietnamese. Therefore, we con- that such words can be automatically converted according sider a word as the smallest syntactic unit having a com- to the need. plete meaning and preventing syntactic rules from analyz- ing wordstructure (Sciullo and Williams, 1987). On the ba- 4. Part-of-speech tagging guidelines sis of this word definition, we propose the following rules to solve the difficulties in Vietnamese word segmentation: 4.1. Challenges of POS tagging • If A and B5 have different meanings and the meaning Tagging POSforVietnamesewordsisnotatrivialproblem of the combination form (A_B) is different from the because they are not marked with morphological features, split form (A B), we select the form that has a mean- such as tense, number, gender, etc. While the same prob- ing more appropriate for the context. Examples 1 and lem also appears with Chinese, Vietnamese may be more 2 in Table 1 show an expression having two different difficult, because the Vietnamese writing system is based meanings because of different word segmentation. on Latin characters, which represent the pronunciation, but • If A and B have different meanings and A_B has the not the meaning of words. same meaning as A or B, the combination form is se- Words that have the same surface form and pronunciation lected. The example is given in row 3 of Table 1. but different meanings and grammar functions occur fre- quently in the text. For example, we can understand the • If A and B have the same meaning, the combination word mới in accordance with two meanings shown in rows form is selected (example 4 in Table 1). 1and2ofTable2.Ifweconsidermớiasanadjectivemod- ifying the preceding word, the noun nghiên_cứuresearch, • If another syllable can be inserted between A and B, it means new; The word mới means recently or just if we weselect the split form (examples 5 and 6 in Table 1). consider it as an adjunct modifying the following word, the • IfAisawordandBisnot(orviceversa),weselectthe verb thực_hiệnto conduct. combination form. Example 7 in Table 1 shows that if Determining POS of the words having the same surface đúa is considered as a single word, its meaning is un- form may be more ambiguous because a verb or an adjec- defined. Therefore, it is considered as part of a multi- tive can appear in the position of a noun as in the case of syllable word. báo cáo in rows 3 and 4 of Table 2. Solely referring to the sentence, we do not have any clue to determine if báo cáo • For the expression of a categorization noun (A) and belongs to the verb class or noun class. Báo cáo means de- a specific noun (B), if B indicates something different fend if it is considered as a verb (row 3) and thesis if it is considered as a noun (row 4). 4Categorization nouns indicate general entities, such as cá Ambiguity of the POS tagging is also caused by the omis- fish sion of words which happens frequently in Vietnamese. For and cây . tree 5Without loss of generalization, we assume the expression we example, if a verb or an adjective plays the same roles as wanttosegmentisAB,whereAandBcanbesyllablesorwords. a noun, it is actually preceded by a special classifier noun 1534 No. Wordincontext Word POS No. POS Meaningoftag No. POS Meaningoftag 1 MộtnghiêncứumớithựchiệntạiNhật. mới Adjective tag tag {AnewreseachconductedinJapan.} new 1 SV Sino-Vietnamese 17 NA Noun-adjective 2 MộtnghiêncứumớithựchiệntạiNhật. mới Adjunct syllable 18 Vcp Comparative verb {Aresearch has just conducted in Japan.} just 2 Nc Classifier noun 19 Vv Other verb 3 Báocáotốtnghiệpcủacôấyrấttốt. báo cáo Verb 3 Ncs Special classifier noun 20 An Ordinal number {Her final defense is very good.} {defense} 4 Nu Unit noun 21 Aa Other adjective 4 Báocáotốtnghiệpcủacôấyrấttốt. báo cáo Noun 5 Nun Administrative unit noun 22 Pd Demonstrative pronoun {Her thesis is very good.} {thesis} 6 Nw Quantifier indicating 23 Pp Other pronoun Việc báo cáo tốt nghiệp của cô ấy rất tốt. việc báo cáo the whole 24 R Adjunct 5 {Her final defense is very good.} {defense} Verb 7 Num Number 25 Cs Preposition or conjunction Cuốnbáocáotốtnghiệpcủacôấyrấttốt. cuốn báo cáo 8 Nq Other quantifier introducing a clause 6 {Her thesis is very good.} {thesis} Noun 9 Nr Proper noun 26 Cp Other conjunction Bạnsẽđẹpnhấtđêmnay. 10 Nt Nounoftime 27 ON Onomatopoeia 7 sẽ Adjunct {You will be the most beautiful girl tonight.} will 11 Nn Other noun 28 ID Idioms Tôi sẽ đi Nhật vào tối nay. 12 Ve Exitting verb 29 E Exclamation word 8 sẽ Adjunct 13 Vc Copula "là" verb 30 M Modifier word {I will go to Japan tonight.} will 14 D Directional verb 31 FW Foreign word Table 2: Examples illustrating the challenges of POS tag- 15 VA Verb-adjective 32 X Unidentified word 16 VN Verb-noun 33 PU Punctuation ging. Table 3: POS tag set designed for our treebank. 6 (as the case of báo cáo in rows 5 of Table 2). Otherwise, a noun is preceded by a classifier noun7 (the noun báo cáo tag P to annotate all pronouns. However, the pronouns used in row 6 of Table 2 follows the classifier noun cuốn). How- to express space or time (demonstrative pronouns) such as ever, such useful nouns are usually omitted in Vietnamese này and đó can be modifiers of the head nouns in sentences which causes ambiguity of tagging words. this that noun phrases. Personal pronouns, in contrast, always play Some linguists (SCSSV, 1983; Diep, 2005) have claimed the roles of the head words of noun phrases. that POS can be recognized by referring to the adjuncts Therefore, in this work, we created a new POS tag set modifying the words. For example, adjuncts indicating de- for Vietnamese. Our criteria to classify the words are also gree and tenses modify adjectives and verbs, respectively. based on the combination abilities and the syntactic func- However, this method does not necessarily work suffi- tions of the words, like those of the VLSP treebank. How- ciently with real texts. In practice, many verbs and adjec- ever, we referred to the linguistics literature, carefully ana- tives in Vietnamese can be modified by the same adjunct. lyzed the roles of words and discussed with our linguistics For example, the adjunct indicating tense, sẽwill shown in colleagues to create a new POS tag set for Vietnamese with Table 2 can modify both the adjective đẹpbeautiful (row 7) 33 tags which are shown in Table 3. Using our POS tags, and the verb đi (row 8). to go wecanrecognizetheroleofawordinaphraseorsentence. Because of the above characteristics of Vietnamese, it is For example, the demonstrative pronouns modifying head difficult not only to define the POS tag set but also to tag words of noun phases are annotated with the Pd label, and each word in context. In addition, there is still little con- personal pronouns that are head words of noun phrases are sensus between linguists as to methodology for classifying annotated with the Pp label. words in Vietnamese. For instance, both Diep (2005) and SCSSV (1983) classified the words based on their mean- 4.3. Policy for annotation of part-of-speech ings, their combination ability, and their syntactic func- In our POS tagging guidelines, the words are tagged on the tions. However, Diep (2005) considered the words express- basis of the following criteria: ing the whole, such as cả , tất_cả , toàn_bộ , etc. all all all as pronouns, while SCSSV (1983), in contrast, considered • Combination ability of the word. For example, them as nouns, and Hoang (1998) considered cả as a pro- khó_khăn can be understood as difficulty or difficult. nounandtất_cả as a noun in all contexts. However, if it is a noun, it cannot combine with the adjunct rất . If it is an adjective, it cannot combine 4.2. Building part-of-speech tag set very In previous work, Nguyen et al. (2009) classified the words with the quantifier những−s/−es. onthebasisoftheircombinationabilityandsyntacticfunc- • Syntactic function of the word. For example, if the tion. They created a POS tag set for Vietnamese includ- quantifier indicating the whole modifies a noun, it will ing a total of 17 tags (except the tags for unknown words beannotatedwithanNwtag.Thequantifierindicating and the punctuation). However, this tag set cannot cover the whole will be annotated with a Pp tag if it is head all the combination abilities as well as the syntactic func- wordofanounphrase. tions of the Vietnamese words. For example, they used the 6Việc is a special classifier noun that is understood as -ion, • Meaningofthewordinthesentence.Forexample,the combination ability of the verb đi and the adjec- -ment, -ing, -ity, -ness, or so on when it comes before verbs or to go adjectives. An expression of the special classifier noun việc and a tive đẹpbeautiful mentionedaboveisthesame,theyare verb or adjective is understood as a noun in English. For example, modified by the adjunct sẽ. They also have the same học_tập means to learn, so to express learning, we can say việc syntactic function which is head word of predicates. học_tập. However, their meanings are different: the adjective 7Classifier nouns indicate two types of things, animate things expresses the quality, and the verb expresses the ac- and inanimate things. tion. 1535
no reviews yet
Please Login to review.