176x Filetype PDF File size 0.14 MB Source: aclanthology.org
TheIITBombayHindi⇔EnglishTranslationSystematWMT2014 Piyush Dungarwal, Rajen Chatterjee, Abhijit Mishra, Anoop Kunchukuttan, Ritesh Shah, Pushpak Bhattacharyya Department of Computer Science and Engineering Indian Institute of Technology, Bombay {piyushdd,rajen,abhijitmishra,anoopk,ritesh,pb}@cse.iitb.ac.in Abstract WMT2014shared task has provided a standard- In this paper, we describe our English- ized test set to evaluate multiple approaches and Hindi and Hindi-English statistical sys- avails the largest publicly downloadable English- temssubmittedtotheWMT14sharedtask. Hindi parallel corpus. Using these resources, The core components of our translation we have developed a phrase-based and a factored systems are phrase based (Hindi-English) basedsystemforHindi-EnglishandEnglish-Hindi and factored (English-Hindi) SMT sys- translation respectively, with pre-processing and tems. We show that the use of num- post-processing components to handle structural ber, case and Tree Adjoining Grammar divergence and morphlogical richness of Hindi. information as factors helps to improve Section 2 describes the issues in Hindi↔English English-Hindi translation, primarily by translation. generating morphological inflections cor- The rest of the paper is organized as follows. rectly. We show improvements to the Section 3 describes corpus preparation and exper- translation systems using pre-procesing imental setup. Section 4 and Section 5 describe andpost-processing components. To over- our English-Hindi and Hindi-English translation come the structural divergence between systemsrespectively. Section 6 describes the post- English and Hindi, we preorder the source processing operations on the output from the core side sentence to conform to the target lan- translation system for handling OOV and named guage word order. Since parallel cor- entities, and for reranking outputs from multiple pus is limited, many words are not trans- systems. Section 7 mentions the details regarding lated. We translate out-of-vocabulary our systems submitted to WMT shared task. Sec- words and transliterate named entities in tion 8 concludes the paper. a post-processing stage. We also investi- 2 ProblemsinHindi⇔English gate ranking of translations from multiple Translation systems to select the best translation. 1 Introduction Languages can be differentiated in terms of structural divergences and morphological mani- India is a multilingual country with Hindi be- festations. English is structurally classified as ing the most widely spoken language. Hindi and a Subject-Verb-Object (SVO) language with a English act as link languages across the coun- poor morphology whereas Hindi is a morpho- try and languages of official communication for logically rich, Subject-Object-Verb (SOV) lan- the Union Government. Thus, the importance of guage. Largely, these divergences are responsi- English⇔Hindi translation is obvious. Over the ble for the difficulties in translation using a phrase last decade, several rule based (Sinha, 1995) , in- based/factoredmodel,whichwesummarizeinthis terlingua based (Dave et. al., 2001) and statistical section. methods(Ramanathanet. al., 2008) have been ex- 2.1 English-to-Hindi plored for English-Hindi translation. In the WMT 2014 shared task, we undertake The fundamental structural differences described thechallengeofimprovingtranslationbetweenthe earlier result in large distance verb and modi- English and Hindi language pair using Statisti- fier movements across English-Hindi. Local re- cal Machine Translation (SMT) techniques. The ordering models prove to be inadequate to over- 90 Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 90–96, c Baltimore, Maryland USA, June 26–27, 2014. 2014 Association for Computational Linguistics come the problem; hence, we transformed the English Hindi source side sentence using pre-ordering rules to Token 2,898,810 3,092,555 conform to the target word order. Availability of Types 95,551 118,285 robust parsers for English makes this approach for Total Characters 18,513,761 17,961,357 English-Hindi translation effective. Total sentences 289,832 289,832 As far as morphology is concerned, Hindi is Sentences (word 188,993 182,777 more richer in terms of case-markers, inflection- count ≤ 10) rich surface forms including verb forms etc. Hindi Sentences (word 100,839 107,055 exhibits gender agreement and syncretism in in- count > 10) flections, which are not observed in English. We attempt to enrich the source side English corpus Table 1: en-hi corpora statistics, post normalisa- with linguistic factors in order to overcome the tion. morphological disparity. 2.2 Hindi-to-English 3.3 DataSplit Before splitting the data, we first randomize the Thelackofaccuratelinguisticparsersmakesitdif- parallel corpus. We filter out English sentences ficult to overcome the structural divergence using longer than 50 words along with their parallel preordering rules. In order to preorder Hindi sen- Hindi translations. After filtering, we select 5000 tences, we build rules using shallow parsing infor- sentenceswhichare10to20wordslongasthetest mation. Thesourcesidereorderinghelpstoreduce data, while remaining 284,832 sentences are used the decoder’s search complexity and learn better for training. phrasetables. Someoftheotherchallengesingen- 4 English-to-Hindi (en-hi) translation eration of English output are: (1) generation of ar- ticles, which Hindi lacks, (2) heavy overloading of WeusetheMOSEStoolkit(Koehnet. al., 2007a) English prepositions, making it difficult to predict forcarryingoutvariousexperiments. Startingwith them. PhraseBasedStatisticalMachineTranslation(PB- SMT)(Koehn et. al., 2003) as baseline system we 3 Experimental Setup goaheadwithpre-orderPBSMTdescribedinSec- Weprocess the corpus through appropriate filters tion 4.1. After pre-ordering, we train a Factor for normalization and then create a train-test split. Based SMT(Koehn, 2007b) model, where we add factors on the pre-ordered source corpus. In Fac- 3.1 English Corpus Normalization tor Based SMT we have two variations- (a) using SupertagasfactordescribedinSection4.2and(b) To begin with, the English data was tokenized us- using number, case as factors described in Section ing the Stanford tokenizer (Klein and Manning, 4.3. 2003)andthentrue-casedusingtruecase.perlpro- 4.1 Pre-ordering source corpus vided in MOSES toolkit. Research has shown that pre-ordering source lan- 3.2 Hindi Corpus Normalization guage to conform to target language word order significantly improves translation quality (Collins For Hindi data, we first normalize the corpus us- et. al, 2005). There are many variations of pre- ing NLP Indic Library (Kunchukuttan et. al., ordering systems primarily emerging from either 2014)1. Normalization is followed by tokeniza- rule based or statistical methods. We use rule tion, wherein we make use of the trivtokenizer.pl2 based pre-ordering approach developed by (Pa- providedwithWMT14sharedtask. InTable1,we tel et. al., 2013), which uses the Stanford parser highlight some of the post normalization statistics (Klein and Manning, 2003) for parsing English for en-hi parallel corpora. sentences. This approach is an extension to an ear- lier approach developed by (Ramanathan et. al., 1https://bitbucket.org/anoopk/indic_ 2008). The existing source reordering system re- nlp_library 2http://ufallab.ms.mff.cuni.cz/~bojar/ quires the input text to contain only surface form, hindencorp/ however, we extended it to support surface form 91 along with its factors like POS, lemma etc.. An Weuse MICA Parser (Bangalore et. al., 2009) example of improvement in translation after pre- forobtainingsupertags. Aftersupertaggingwerun ordering is shown below: pre-ordering system preserving the supertags in it. Example: trying to replace bad ideas with good For translation, we create mapping from source- ideas . word|supertag to target-word. An example of im- Phr: replace br EvcAro\ кo aQC EvcAro\ к provement in translation by using supertag as fac- sAT tor is shown below: (replace bure vichaaron ko acche vichaaron ke Example: trying to understand what your child is saath) saying to you Gloss: replace bad ideas good ideas with Phr: aApкA bÎA aAps ÈA кh rhA {h yh Pre-orderPBSMT:aQC EvcAro\ s br EvcAro\ (aapkaa bacchaa aapse kya kaha rahaa hai yaha) кo bdln кF кoEшш кr rh {h\ Gloss: your child you what saying is this (acche vichaaron se bure vichaaron ko badalane SupertagFact: aApкA bÎA aAps ÈA кh rhA ki koshish kara rahe hain) {h , us smJn кF кoEшш кrnA Gloss: good ideas with bad ideas to replace trying (aapkaa bacchaa aapse kya kaha rahaa hai, use 4.2 Supertag as Factor samajhane kii koshish karnaa) Gloss: your child to you what saying is , that un- The notion of Supertag was first proposed by derstand try Joshi and Srinivas (1994). Supertags are elemen- tary trees of Lexicalized Tree Adjoining Grammar 4.3 Number,CaseasFactor (LTAG) (Joshi and Schabes, 1991). They provide In this section, we discuss how to generate correct syntactic as well as dependency information at the noun inflections while translating from English to word level by imposing complex constraints in a Hindi. Therehasbeenpreviousworkdoneinorder local context. These elementary trees are com- to solve the problem of data sparsity due to com- bined in some manner to form a parse tree, due plex verb morphology for English to Hindi trans- to which, supertagging is also known as “An ap- lation (Gandhe, 2011). Noun inflections in Hindi proach to almost parsing”(Bangalore and Joshi, are affected by the number and case of the noun 1999). A supertag can also be viewed as frag- only. Number can be singular or plural, whereas, ments of parse trees associated with each lexi- case can be direct or oblique. We use the factored cal item. Figure 1 shows an example of su- SMTmodeltoincorporatethislinguistic informa- pertagged sentence “The purchase price includes tion during training of the translation models. We taxes”describedin(Hassanet. al.,2007). Itclearly attach root-word, number and case as factors to shows the sub-categorization information avail- English nouns. On the other hand, to Hindi nouns able in the verb include, which takes subject NP we attach root-word and suffix as factors. We de- to its left and an object NP to its right. finethetranslation and generation step as follows: • Translation step (T0): Translates English root|number|case to Hindi root|suffix • Generation step (G0): Generates Hindi sur- face word from Hindi root|suffix Figure 1: LTAG supertag sequence obtained using An example of improvement in translation by MICAParser. using number and case as factors is shown below: Use of supertags as factors has already been Example: Twosets of statistics studied by Hassan (2007) in context of Arabic- Phr: do к aA кw English SMT. They use supertag language model (do ke aankade) along with supertagged English corpus. Ours Gloss: two of statistics is the first study in using supertag as factor Num-CaseFact: aA кwo\ к do sV for English-to-Hindi translation on a pre-ordered (aankadon ke do set) source corpus. Gloss: statistics of two sets 92 4.3.1 Generating number and case factors Development WMT14 With the help of syntactic and morphological Model BLEU TER BLEU TER tools, we extract the number and case of the En- Phr 27.62 0.63 8.0 0.84 glish nouns as follows: PhrReord 28.64 0.62 8.6 0.86 • Number factor: We use Stanford POS tag- PhrReord+STag 27.05 0.64 9.8 0.83 ger3 to identify the English noun entities PhrReord+NC 27.50 0.64 10.1 0.83 (Toutanova, 2003). ThePOStaggeritselfdif- Table 2: English-to-Hindi automatic evaluation on ferentiates between singular and plural nouns development set and on WMT14 test set. by using different tags. • Case factor: It is difficult to find the that pre-orders the source sentence to conform to direct/oblique case of the nouns as En- target word order. glish nouns do not contain this information. A substantial volume of work has been done Hence, to get the case information, we need in the field of source-side reordering for machine to find out features of an English sentence translation. Most of the experiments are based on that correspond to direct/oblique case of the applying reordering rules at the nodes of the parse parallel nouns in Hindi sentence. We use tree of the source sentence. These reordering rules object of preposition, subject, direct object, can be automatically learnt (Genzel, 2010). But, tense as our features. These features are manysource languages do not have a good robust extracted using semantic relations provided parser. Hence, instead we can use shallow pars- by Stanford’s typed dependencies (Marneffe, ing techniques to get chunks of words and then 2008). reorder them. Reordering rules can be learned au- 4.4 Results tomatically from chunked data (Zhang, 2007). Hindi does not have a functional constituency Listed below are different statistical systems or dependency parser available, as of now. But, trained using Moses: a shallow parser4 is available for Hindi. Hence, • Phrase Based model (Phr) we follow a chunk-based pre-ordering approach, wherein, we develop a set of rules to reorder • Phrase Based model with pre-ordered source the chunks in a source sentence. The follow- corpus (PhrReord) ing are the chunks tags generated by this shallow parser: Noun chunks (NP), Verb chunks (VGF, • Factor Based Model with factors on pre- VGNF, VGNN), Adjectival chunks (JJP), Ad- ordered source corpus verbchunks(RBP),Negatives(NEGP),Conjuncts – Supertag as factor (PhrReord+STag) (CCP), Chunk fragments (FRAGP), and miscella- – Number,Caseasfactor(PhrReord+NC) neous entities (BLK) (Bharati, 2006). Weevaluated translation systems with BLEU and 5.1 Developmentofrules TERasshowninTable2. Evaluationonthedevel- After chunking an input sentence, we apply hand- opmentsetshowsthatfactorbasedmodelsachieve crafted reordering rules on these chunks. Follow- competitive scores as compared to the baseline ing sections describe these rules. Note that we ap- system, whereas, evaluation on the WMT14 test ply rules in the same order they are listed below. set shows significant improvement in the perfor- 5.1.1 Mergingofchunks manceoffactor based models. After chunking, we merge the adjacent chunks, if 5 Hindi-to-English (hi-en) translation they follow same order in target language. AsEnglishfollowsSVOwordorderandHindifol- 1. Merge {JJP VGF} chunks (Consider this lowsSOVwordorder,simpledistortionpenaltyin chunk as a single VGF chunk) phrase-basedmodelscannothandlethereordering e.g., vEZ t {h (varnit hai), E-Tt {h (sthit hai) well. For the shared task, we follow the approach 4http://ltrc.iiit.ac.in/showfile.php? 3http://nlp.stanford.edu/software/tagger.shtml filename=downloads/shallow_parser.php 93
no reviews yet
Please Login to review.