jagomart
digital resources
picture1_Hindi Grammar Pdf 100654 | W14 3308


 176x       Filetype PDF       File size 0.14 MB       Source: aclanthology.org


File: Hindi Grammar Pdf 100654 | W14 3308
theiitbombayhindi englishtranslationsystematwmt2014 piyush dungarwal rajen chatterjee abhijit mishra anoop kunchukuttan ritesh shah pushpak bhattacharyya department of computer science and engineering indian institute of technology bombay piyushdd rajen abhijitmishra anoopk ritesh ...

icon picture PDF Filetype PDF | Posted on 22 Sep 2022 | 3 years ago
Partial capture of text on file.
                       TheIITBombayHindi⇔EnglishTranslationSystematWMT2014
                          Piyush Dungarwal, Rajen Chatterjee, Abhijit Mishra, Anoop Kunchukuttan,
                                                   Ritesh Shah, Pushpak Bhattacharyya
                                            Department of Computer Science and Engineering
                                                   Indian Institute of Technology, Bombay
                   {piyushdd,rajen,abhijitmishra,anoopk,ritesh,pb}@cse.iitb.ac.in
                                        Abstract                            WMT2014shared task has provided a standard-
                      In this paper, we describe our English-               ized test set to evaluate multiple approaches and
                      Hindi and Hindi-English statistical sys-              avails the largest publicly downloadable English-
                      temssubmittedtotheWMT14sharedtask.                    Hindi parallel corpus.       Using these resources,
                      The core components of our translation                we have developed a phrase-based and a factored
                      systems are phrase based (Hindi-English)              basedsystemforHindi-EnglishandEnglish-Hindi
                      and factored (English-Hindi) SMT sys-                 translation respectively, with pre-processing and
                      tems.    We show that the use of num-                 post-processing components to handle structural
                      ber, case and Tree Adjoining Grammar                  divergence and morphlogical richness of Hindi.
                      information as factors helps to improve               Section 2 describes the issues in Hindi↔English
                      English-Hindi translation, primarily by               translation.
                      generating morphological inflections cor-                 The rest of the paper is organized as follows.
                      rectly.   We show improvements to the                 Section 3 describes corpus preparation and exper-
                      translation systems using pre-procesing               imental setup. Section 4 and Section 5 describe
                      andpost-processing components. To over-               our English-Hindi and Hindi-English translation
                      come the structural divergence between                systemsrespectively. Section 6 describes the post-
                      English and Hindi, we preorder the source             processing operations on the output from the core
                      side sentence to conform to the target lan-           translation system for handling OOV and named
                      guage word order.        Since parallel cor-          entities, and for reranking outputs from multiple
                      pus is limited, many words are not trans-             systems. Section 7 mentions the details regarding
                      lated.    We translate out-of-vocabulary              our systems submitted to WMT shared task. Sec-
                      words and transliterate named entities in             tion 8 concludes the paper.
                      a post-processing stage. We also investi-             2    ProblemsinHindi⇔English
                      gate ranking of translations from multiple                 Translation
                      systems to select the best translation.
                  1    Introduction                                         Languages can be differentiated in terms of
                                                                            structural divergences and morphological mani-
                  India is a multilingual country with Hindi be-            festations.   English is structurally classified as
                  ing the most widely spoken language. Hindi and            a Subject-Verb-Object (SVO) language with a
                  English act as link languages across the coun-            poor morphology whereas Hindi is a morpho-
                  try and languages of official communication for            logically rich, Subject-Object-Verb (SOV) lan-
                  the Union Government. Thus, the importance of             guage. Largely, these divergences are responsi-
                  English⇔Hindi translation is obvious. Over the            ble for the difficulties in translation using a phrase
                  last decade, several rule based (Sinha, 1995) , in-       based/factoredmodel,whichwesummarizeinthis
                  terlingua based (Dave et. al., 2001) and statistical      section.
                  methods(Ramanathanet. al., 2008) have been ex-            2.1   English-to-Hindi
                  plored for English-Hindi translation.
                     In the WMT 2014 shared task, we undertake              The fundamental structural differences described
                  thechallengeofimprovingtranslationbetweenthe              earlier result in large distance verb and modi-
                  English and Hindi language pair using Statisti-           fier movements across English-Hindi. Local re-
                  cal Machine Translation (SMT) techniques. The             ordering models prove to be inadequate to over-
                                                                         90
                                    Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 90–96,
                                                                         c
                               Baltimore, Maryland USA, June 26–27, 2014. 
2014 Association for Computational Linguistics
                 come the problem; hence, we transformed the                                        English         Hindi
                 source side sentence using pre-ordering rules to          Token                  2,898,810     3,092,555
                 conform to the target word order. Availability of         Types                     95,551       118,285
                 robust parsers for English makes this approach for        Total Characters     18,513,761     17,961,357
                 English-Hindi translation effective.                      Total sentences         289,832        289,832
                    As far as morphology is concerned, Hindi is            Sentences   (word       188,993        182,777
                 more richer in terms of case-markers, inflection-          count ≤ 10)
                 rich surface forms including verb forms etc. Hindi        Sentences   (word       100,839        107,055
                 exhibits gender agreement and syncretism in in-           count > 10)
                 flections, which are not observed in English. We
                 attempt to enrich the source side English corpus        Table 1: en-hi corpora statistics, post normalisa-
                 with linguistic factors in order to overcome the        tion.
                 morphological disparity.
                 2.2   Hindi-to-English                                  3.3   DataSplit
                                                                         Before splitting the data, we first randomize the
                 Thelackofaccuratelinguisticparsersmakesitdif-           parallel corpus. We filter out English sentences
                 ficult to overcome the structural divergence using       longer than 50 words along with their parallel
                 preordering rules. In order to preorder Hindi sen-      Hindi translations. After filtering, we select 5000
                 tences, we build rules using shallow parsing infor-     sentenceswhichare10to20wordslongasthetest
                 mation. Thesourcesidereorderinghelpstoreduce            data, while remaining 284,832 sentences are used
                 the decoder’s search complexity and learn better        for training.
                 phrasetables. Someoftheotherchallengesingen-            4   English-to-Hindi (en-hi) translation
                 eration of English output are: (1) generation of ar-
                 ticles, which Hindi lacks, (2) heavy overloading of     WeusetheMOSEStoolkit(Koehnet. al., 2007a)
                 English prepositions, making it difficult to predict     forcarryingoutvariousexperiments. Startingwith
                 them.                                                   PhraseBasedStatisticalMachineTranslation(PB-
                                                                         SMT)(Koehn et. al., 2003) as baseline system we
                 3    Experimental Setup                                 goaheadwithpre-orderPBSMTdescribedinSec-
                 Weprocess the corpus through appropriate filters         tion 4.1.  After pre-ordering, we train a Factor
                 for normalization and then create a train-test split.   Based SMT(Koehn, 2007b) model, where we add
                                                                         factors on the pre-ordered source corpus. In Fac-
                 3.1   English Corpus Normalization                      tor Based SMT we have two variations- (a) using
                                                                         SupertagasfactordescribedinSection4.2and(b)
                 To begin with, the English data was tokenized us-       using number, case as factors described in Section
                 ing the Stanford tokenizer (Klein and Manning,          4.3.
                 2003)andthentrue-casedusingtruecase.perlpro-            4.1   Pre-ordering source corpus
                 vided in MOSES toolkit.                                 Research has shown that pre-ordering source lan-
                 3.2   Hindi Corpus Normalization                        guage to conform to target language word order
                                                                         significantly improves translation quality (Collins
                 For Hindi data, we first normalize the corpus us-        et. al, 2005). There are many variations of pre-
                 ing NLP Indic Library (Kunchukuttan et.          al.,   ordering systems primarily emerging from either
                 2014)1. Normalization is followed by tokeniza-          rule based or statistical methods.    We use rule
                 tion, wherein we make use of the trivtokenizer.pl2      based pre-ordering approach developed by (Pa-
                 providedwithWMT14sharedtask. InTable1,we                tel et. al., 2013), which uses the Stanford parser
                 highlight some of the post normalization statistics     (Klein and Manning, 2003) for parsing English
                 for en-hi parallel corpora.                             sentences. This approach is an extension to an ear-
                                                                         lier approach developed by (Ramanathan et. al.,
                    1https://bitbucket.org/anoopk/indic_                 2008). The existing source reordering system re-
                 nlp_library
                    2http://ufallab.ms.mff.cuni.cz/~bojar/               quires the input text to contain only surface form,
                 hindencorp/                                             however, we extended it to support surface form
                                                                      91
                      along with its factors like POS, lemma etc.. An                           Weuse MICA Parser (Bangalore et. al., 2009)
                      example of improvement in translation after pre-                      forobtainingsupertags. Aftersupertaggingwerun
                      ordering is shown below:                                              pre-ordering system preserving the supertags in it.
                      Example: trying to replace bad ideas with good                        For translation, we create mapping from source-
                      ideas .                                                               word|supertag to target-word. An example of im-
                      Phr: replace br EvcAro\ кo aQC EvcAro\ к                           provement in translation by using supertag as fac-
                                          
                      sAT                                                                   tor is shown below:
                      (replace bure vichaaron ko acche vichaaron ke                         Example: trying to understand what your child is
                      saath)                                                                saying to you
                      Gloss: replace bad ideas good ideas with                              Phr: aApкA bÎA aAps ÈA кh rhA {h yh
                      Pre-orderPBSMT:aQC EvcAro\ s br EvcAro\                            (aapkaa bacchaa aapse kya kaha rahaa hai yaha)
                                                                          
                      кo bdln кF кoEшш кr rh {h\                                          Gloss: your child you what saying is this
                      (acche vichaaron se bure vichaaron ko badalane                        SupertagFact: aApкA bÎA aAps ÈA кh rhA
                      ki koshish kara rahe hain)                                            {h   ,  us smJn кF кoEшш кrnA
                      Gloss: good ideas with bad ideas to replace trying                    (aapkaa bacchaa aapse kya kaha rahaa hai, use
                      4.2    Supertag as Factor                                             samajhane kii koshish karnaa)
                                                                                            Gloss: your child to you what saying is , that un-
                      The notion of Supertag was first proposed by                           derstand try
                      Joshi and Srinivas (1994). Supertags are elemen-
                      tary trees of Lexicalized Tree Adjoining Grammar                      4.3     Number,CaseasFactor
                      (LTAG) (Joshi and Schabes, 1991). They provide                        In this section, we discuss how to generate correct
                      syntactic as well as dependency information at the                    noun inflections while translating from English to
                      word level by imposing complex constraints in a                       Hindi. Therehasbeenpreviousworkdoneinorder
                      local context.       These elementary trees are com-                  to solve the problem of data sparsity due to com-
                      bined in some manner to form a parse tree, due                        plex verb morphology for English to Hindi trans-
                      to which, supertagging is also known as “An ap-                       lation (Gandhe, 2011). Noun inflections in Hindi
                      proach to almost parsing”(Bangalore and Joshi,                        are affected by the number and case of the noun
                      1999). A supertag can also be viewed as frag-                         only. Number can be singular or plural, whereas,
                      ments of parse trees associated with each lexi-                       case can be direct or oblique. We use the factored
                      cal item.      Figure      1 shows an example of su-                  SMTmodeltoincorporatethislinguistic informa-
                      pertagged sentence “The purchase price includes                       tion during training of the translation models. We
                      taxes”describedin(Hassanet. al.,2007). Itclearly                      attach root-word, number and case as factors to
                      shows the sub-categorization information avail-                       English nouns. On the other hand, to Hindi nouns
                      able in the verb include, which takes subject NP                      we attach root-word and suffix as factors. We de-
                      to its left and an object NP to its right.                            finethetranslation and generation step as follows:
                                                                                                • Translation step (T0):             Translates English
                                                                                                   root|number|case to Hindi root|suffix
                                                                                                • Generation step (G0): Generates Hindi sur-
                                                                                                   face word from Hindi root|suffix
                      Figure 1: LTAG supertag sequence obtained using                           An example of improvement in translation by
                      MICAParser.                                                           using number and case as factors is shown below:
                         Use of supertags as factors has already been                       Example: Twosets of statistics
                      studied by Hassan (2007) in context of Arabic-                        Phr: do к aA кw
                      English SMT. They use supertag language model                         (do ke aankade)
                      along with supertagged English corpus.                     Ours       Gloss: two of statistics
                      is the first study in using supertag as factor                         Num-CaseFact: aA кwo\ к do sV
                      for English-to-Hindi translation on a pre-ordered                     (aankadon ke do set)
                      source corpus.                                                        Gloss: statistics of two sets
                                                                                        92
                 4.3.1   Generating number and case factors                               Development         WMT14
                 With the help of syntactic and morphological            Model            BLEU TER BLEU TER
                 tools, we extract the number and case of the En-        Phr              27.62     0.63     8.0     0.84
                 glish nouns as follows:                                 PhrReord         28.64     0.62     8.6     0.86
                    • Number factor: We use Stanford POS tag-            PhrReord+STag 27.05        0.64     9.8     0.83
                      ger3 to identify the English noun entities         PhrReord+NC      27.50     0.64    10.1     0.83
                      (Toutanova, 2003). ThePOStaggeritselfdif-        Table 2: English-to-Hindi automatic evaluation on
                      ferentiates between singular and plural nouns    development set and on WMT14 test set.
                      by using different tags.
                    • Case factor:     It is difficult to find the       that pre-orders the source sentence to conform to
                      direct/oblique case of the nouns as En-          target word order.
                      glish nouns do not contain this information.        A substantial volume of work has been done
                      Hence, to get the case information, we need      in the field of source-side reordering for machine
                      to find out features of an English sentence       translation. Most of the experiments are based on
                      that correspond to direct/oblique case of the    applying reordering rules at the nodes of the parse
                      parallel nouns in Hindi sentence.     We use     tree of the source sentence. These reordering rules
                      object of preposition, subject, direct object,   can be automatically learnt (Genzel, 2010). But,
                      tense as our features.    These features are     manysource languages do not have a good robust
                      extracted using semantic relations provided      parser. Hence, instead we can use shallow pars-
                      by Stanford’s typed dependencies (Marneffe,      ing techniques to get chunks of words and then
                      2008).                                           reorder them. Reordering rules can be learned au-
                 4.4   Results                                         tomatically from chunked data (Zhang, 2007).
                                                                          Hindi does not have a functional constituency
                 Listed below are different statistical systems        or dependency parser available, as of now. But,
                 trained using Moses:                                  a shallow parser4 is available for Hindi. Hence,
                    • Phrase Based model (Phr)                         we follow a chunk-based pre-ordering approach,
                                                                       wherein, we develop a set of rules to reorder
                    • Phrase Based model with pre-ordered source       the chunks in a source sentence.      The follow-
                      corpus (PhrReord)                                ing are the chunks tags generated by this shallow
                                                                       parser: Noun chunks (NP), Verb chunks (VGF,
                    • Factor Based Model with factors on pre-          VGNF, VGNN), Adjectival chunks (JJP), Ad-
                      ordered source corpus                            verbchunks(RBP),Negatives(NEGP),Conjuncts
                         – Supertag as factor (PhrReord+STag)          (CCP), Chunk fragments (FRAGP), and miscella-
                         – Number,Caseasfactor(PhrReord+NC)            neous entities (BLK) (Bharati, 2006).
                 Weevaluated translation systems with BLEU and         5.1   Developmentofrules
                 TERasshowninTable2. Evaluationonthedevel-             After chunking an input sentence, we apply hand-
                 opmentsetshowsthatfactorbasedmodelsachieve            crafted reordering rules on these chunks. Follow-
                 competitive scores as compared to the baseline        ing sections describe these rules. Note that we ap-
                 system, whereas, evaluation on the WMT14 test         ply rules in the same order they are listed below.
                 set shows significant improvement in the perfor-       5.1.1   Mergingofchunks
                 manceoffactor based models.
                                                                       After chunking, we merge the adjacent chunks, if
                 5   Hindi-to-English (hi-en) translation              they follow same order in target language.
                 AsEnglishfollowsSVOwordorderandHindifol-                1. Merge {JJP VGF} chunks (Consider this
                 lowsSOVwordorder,simpledistortionpenaltyin                 chunk as a single VGF chunk)
                 phrase-basedmodelscannothandlethereordering                e.g., vEZ
t {h (varnit hai), E-Tt {h (sthit hai)
                 well. For the shared task, we follow the approach
                                                                          4http://ltrc.iiit.ac.in/showfile.php?
                    3http://nlp.stanford.edu/software/tagger.shtml     filename=downloads/shallow_parser.php
                                                                    93
The words contained in this file might help you see if this file matches what you are looking for:

...Theiitbombayhindi englishtranslationsystematwmt piyush dungarwal rajen chatterjee abhijit mishra anoop kunchukuttan ritesh shah pushpak bhattacharyya department of computer science and engineering indian institute technology bombay piyushdd abhijitmishra anoopk pb cse iitb ac in abstract wmtshared task has provided a standard this paper we describe our english ized test set to evaluate multiple approaches hindi statistical sys avails the largest publicly downloadable temssubmittedtothewmtsharedtask parallel corpus using these resources core components translation have developed phrase based factored systems are basedsystemforhindi englishandenglish smt respectively with pre processing tems show that use num post handle structural ber case tree adjoining grammar divergence morphlogical richness information as factors helps improve section describes issues primarily by generating morphological inections cor rest is organized follows rectly improvements preparation exper procesing imental...

no reviews yet
Please Login to review.