jagomart
digital resources
picture1_Language Pdf 103599 | Wmt 49


 137x       Filetype PDF       File size 0.42 MB       Source: www.statmt.org


File: Language Pdf 103599 | Wmt 49
nuig panlingua kmihindi marathimtsystemsforsimilar languagetranslation task wmt2020 atul kr ojha priya rani akanksha bansal bharathi raja chakravarthi ritesh kumar john p mccrae data science institute nuig galway panlingua language processing ...

icon picture PDF Filetype PDF | Posted on 23 Sep 2022 | 3 years ago
Partial capture of text on file.
                          NUIG-Panlingua-KMIHindi↔MarathiMTSystemsforSimilar
                                           LanguageTranslation Task @ WMT2020
                                           !+                !                       +                                       !
                         Atul Kr. Ojha , Priya Rani , Akanksha Bansal , Bharathi Raja Chakravarthi ,
                                                                         *                      !
                                                       Ritesh Kumar , John P. McCrae
                             !Data Science Institute, NUIG, Galway, +Panlingua Language Processing LLP,
                                           NewDelhi,*Dr. Bhimrao Ambedkar University, Agra
                     (atulkumar.ojha,priya.rani,bharathi.raja)@insight-centre.org,
                                  panlingua@outlook.com,john.mccrae@nuigalway.ie,
                                                        ritesh78 llh@jnu.ac.in
                                        Abstract                             as OpenNMT(Kleinetal.,2017),Marian(Junczys-
                       NUIG-Panlingua-KMI submission to WMT                  Dowmunt et al., 2018) and Neamtus (Sennrich
                       2020 seeks to push the state-of-the-art in the        et al., 2017), which provide various ways of ex-
                       Similar language translation task for the Hindi       perimenting with the use of different features and
                       ↔Marathi language pair. As part of these ef-          architectures, yet it fails to achieve the same re-
                       forts, we conducted a series of experiments           sults with low resource languages (Chakravarthi
                       to address the challenges for translation be-         et al., 2018, 2019b). However, Sennrich and Zhang
                       tween similar languages. Among the 4 MT               (2019) revisited the NMT models and tuned hyper-
                       systems prepared for this task, 1 PBSMT sys-          parameters, changed network architectures to op-
                       tems were prepared for Hindi ↔ Marathi each           timize NMTfor low-resource conditions and con-
                       and 1 NMTsystemsweredevelopedforHindi                 cluded that low-resource NMT is very sensitive
                       ↔Marathi using Byte Pair Encoding (BPE)               to hyper-parameters such as Byte Pair Encoding
                       of subwords. The results show that different
                       architectures in NMT could be an effective            (BPE) vocabulary size, word dropout, and others.
                       methodfordevelopingMTsystemsforclosely                This paper is an extension of our work Ojha et al.
                       related languages. Our Hindi-Marathi NMT              (2019) submitted to WMT 2019 similar language
                                           th
                       systemwasranked8 amongthe14teamsthat                  translation task. Therefore our team adapted meth-
                       participated and our Marathi-Hindi NMT sys-           ods of the low resource setting for NMT proposed
                                        th
                       temwasranked8 amongthe11teamspartic-                  by Sennrich and Zhang (2019) to explore the fol-
                       ipated for the task.                                  lowing broad objectives:
                   1   Introduction
                   Developing automated relations between closely                • to compare the performance of SMT and
                   related languages is a contemporary concern espe-               NMTincaseofcloselyrelated,relatively low-
                   cially in the domain of Machine Translation(MT).                resourced language pairs, and
                   Hindi and Marathi exhibit a significant overlap in             • to findout how to leverage the accuracy of
                   their vocabularies and strong syntactic plus lexi-              NMTincloselyrelated languages using BPE
                   cal similarities. These striking similarities seem              into subwords.
                   promising in enhancing the possibility of mutual
                   inter-comprehension within closely related lan-               • to analyze the effects of data quality in perfor-
                   guages. However, automated translation between                  manceofthesystems.
                   such closely related languages is a rather challeng-
                   ing task.                                                 2    SystemDescription
                  Thelinguistic similarities and regularities in mor-
                   phological variations and orthography motivate the        This section provides an overview of the systems
                   use of character-level translation models, which          developedfortheWMT2020SharedTask. Inthese
                   have been applied to translation (Vilar et al.,           experiments, the NUIG-Panlingua-KMI team ex-
                   2007; Chakravarthi et al., 2020) and translitera-         plored two different approaches: phrase-based sta-
                   tion (Matthews, 2007; Chakravarthi et al., 2019a;         tistical (Koehn et al., 2003), and neural method for
                   Chakravarthi, 2020). In the past few years, neu-          Hindi-Marathi and Marathi-Hindi language pairs.
                   ral machine translation systems have achieved             In all the submitted systems, we use the Moses
                   outstanding performance with high resource lan-           (Koehn et al., 2007) and Nematus (Sennrich et al.,
                   guages, with the help of open source toolkit such         2017) toolkit for developing statistical and neural
                                                                         418
                                      Proceedings of the 5th Conference on Machine Translation (WMT), pages 418–423
                                                                    c
                                     Online, November 19–20, 2020. 
2020 Association for Computational Linguistics
                  machine translation systems respectively. The pre-       Outof43274training sentences, the Hindi corpus
                  processing was done to handle noise in data (for         hadTelugusentences while the Marathi corpus had
                  example, different language sentences, non-UTF           Meitei sentences intermingled as shown in first row
                  characters etc), the details of which are provided in    (Figure 1). The parallel data had more than 1192
                  section 3.1                                              lines that were not comparable with each other as
                  2.1   Phrase-based SMTSystems                            shown in second and third row (Figure 1), where
                                                                           someHindisentences had only half the sentences
                  ThesesystemswerebuiltontheMosesopensource                translated in Marathi (second row) and some had
                  toolkit using the KenLM (Heafield, 2011) language         blank spaces against their Marathi counter parts
                  modelandGIZA++(OchandNey,2003)aligner.                   (third row). The translation quality of the parallel
                  ‘Grow-diag-final-and heuristic’ parameters were           datawasalsonotuptomark. Infact,theteamcould
                  used to extract phrases from the corresponding par-      locate a few instances of synthetic data. There were
                  allel corpora. In addition to this, KenLM was used       a few sentences where character encoding was an
                  to build 5-gram language models.                         issue, hence were completely unintelligible.
                  2.2   Neural Machine Translation System                      LanguagePair    Training  Tuning Monolingual
                  Nematus was used to build 2 NMT systems. As                 Hindi ↔Marathi     43274    1411          -
                                                                                  Marathi          -        -        326748
                  we mentioned in an earlier section, at first data                 Hindi           -        -       75348193
                  waspre-processed at subwords level with BPE for          Table 1: Statistics of Parallel and Monolingual Sen-
                  neural translation, and then the system was trained      tences of the Hindi and Marathi Languages
                  using Nematus toolkit. Most of the system features
                  were adopted from (Sennrich et al., 2017; Koehn
                  and Knowles, 2017) (see section 3.3.2).                  3.2   Pre-processing
                  2.3   Assessment                                         Thefollowingpre-processingstepswereperformed
                  Assessment of these systems was done on the stan-        as part of the experiments:
                  dard automatic evaluation metrics: BLEU (Pap-
                  ineni et al., 2002), Rank-based Intuitive Bilingual        a) Bothcorporaweretokenizedandcleaned(sen-
                  Evaluation Score (RIBES) (Isozaki et al., 2010)                tences of length over 80 words were removed).
                  and Translation Error Rate (TER) (Snover et al.,
                  2006).                                                     b) For neural translation, training, validation and
                                                                                 test data was prepossessed into subwords BPE
                  3   Experiments                                                format. This format was utilised to prepare
                  This section briefly describes the experiment set-              BPEandvocabularyfurther used.
                  tings for developing the systems.                        All these processes were performed using Moses
                  3.1   DataPreparations                                   scripts. However, the tokenization was done by the
                                                                           RGNLPteamtokenizer(Ojhaetal., 2018) and In-
                  Theparallel data-set for these experiments was pro-                      3
                                                                           dic nlp library. These tokenizers were used since
                  vided by the WMT Similar Translation Shared Task         Mosesdoesnot provide a tokenizer for Indic lan-
                  1 organisers and the Marathi monolingual data-set        guages. Also the RGNLP tokenizer ensured that
                  wastaken from WMT2020SharedTask: Parallel                the canonical Unicode representation of the charac-
                                                                      2
                  Corpus Filtering for Low-Resource Conditions.            ters are retained.
                  Theparallel data was sub-divided into training, tun-
                  ing, and monolingual sets, as detailed in Table 1.       3.3   DevelopmentoftheNUIG-Panlingua-
                  However, the shared data was very noisy.                 KMIMTSystems
                     To enhance the data quality, the team had to          After removing noisy and pre-processing data, the
                  undertake an extensive pre-processing session fo-        following steps were followed to build the NUIG-
                  cused on identifying and cleaning the data-sets.         Panlingua-KMI MTsystems:
                     1http://www.statmt.org/wmt20/similar.
                  html                                                        3https://github.com/anoopkunchukuttan/
                     2https://wmt20similar.cs.upc.edu/                     indic_nlp_library
                                                                       419
                                       Figure 1: Examples of discrepancies in Hindi-Marathi parallel data
                                            Figure 2: Analysis of the PBSMT and NMT’s Systems
                 3.3.1   Building Primary MT Systems:                  guage models were trained on 5-gram. After that,
                 As previously mentioned, the Hindi-Marathi and        thesystemswerebuiltindependentlyandcombined
                 Marathi-Hindi PBSMT systems were built as the         in a loglinear scheme in which each model was as-
                 primary submission using Moses. The language          signed a different weight using the Minimum Error
                 modelwasbuilt first, using KenLM. For Marathi-         Rate Training (Och, 2003) tuning algorithm. To
                 Hindi and Hindi-Marathi language pairs, the lan-      train and tune the systems, we used 40454 and 1411
                                                                   420
                  parallel sentences, respectively, for all language       both the language pairs, subword based NMT per-
                  pairs.                                                   formedbetter than PBSMTasits accuracy rate was
                  3.3.2   Building Contrastive MT Systems:                 higher in BLEU and lower in TER metrics, shown
                  As mentioned in the previous section, Nematus            in Table 2.
                  toolkit was used to develop the NMT systems. The         4.2   Analysis
                  training was done on subword and character-level.        Weusedthereference set provided by the shared
                  All the NMT experiments were carried out only            task organizers to evaluate both PBSMT and NMT
                  with a data-set that contained sentences with length     systems. Even though subword based NMT system
                  of up to 80 words. The neural model is trained on        could take advantage of the shared features among
                  5000epochs, using Adam with a default learning           similar languages, challenges in translating a few
                  rate of 0.002, dropout at 0.01 and mini-batches of       linguistics structures acted as a constraint. Exam-
                  80 and the batch size for the validation was 40.         ple 1 shown in Figure 2 is one of the challenging
                  Vocabulary size of 30000 for both Marathi-Hindi          structures that the system was unable to translate.
                  and Hindi-Marathi language pairs was extracted.          In these sentences the systems could not capture
                  Remainingparameters were limited with the use of         the correct tense and aspect which is past perfect
                  default hyper-parameters configuration.                   in source sentence whereas the NMT system trans-
                  4   Evaluation                                           lated it as simple past. The second most common
                                                                           challenging structures that needed special attention
                  All the systems were evaluated using the reference       were the postpositions as shown in Example 2 and
                  set provided by the shared task organizers. The          3 in the figure. In most cases, the system over-
                  standard MT evaluation metrics, BLEU (Papineni           generalised the sentences in Marathi and generated
                  et al., 2002) score, RIBES (Isozaki et al., 2010) and    unnecessary postposition phrases in Hindi as in Ex-
                  TER(Snoveretal., 2006), were used for automatic          ample 2. Similarly, we can see in Example 3 while
                  evaluation. These results were prepared on the Pri-      translating from Hindi to Marathi both PBSMT and
                  maryandContrastivesystemsubmissionwhichare               NMTsystemsusedwrongpost-positions.
                  mentioned in the Table 2 as P and C, where P
                  stands for Primary and C stands for Contrastive,         5    Conclusion
                  respectively. It gives a quantitative picture of partic- Ourexperiment results reveal that subword based
                  ular differences across different systems, especially    NMTcouldtakeadvantageoftherelation between
                  with reference to evaluation scores (Table 2)            the similar language to boost the accuracy of neural
                            System        BLEU RIBES         TER           machine translations system in low resource data
                        Hindi-Marathi P    9.38     51.88   91.24          settings. As BPE units are variable-length units
                        Hindi-Marathi C    9.76     52.18   91.49          and the vocabularies used are much smaller than
                        Marathi-Hindi P   17.38     59.31   81.47          morphemeandword-levelmodel,theproblemof
                        Marathi-Hindi C   17.39     58.84   81.15
                  Table 2: Accuracy of Hindi↔Marathi MT Systems at         data sparsity does not occur. On the contrary, it
                  BLEU,RIBESandTERMetrics                                  provides an appropriate context for translation be-
                                                                           tween similar languages. However, the quality of
                                                                           datausedtotrainthesystemsdoesaffectthequality
                  4.1   Results                                            of translation. Thus, we could conclude that shared
                  Overall weseevaryingperformanceamongthesys-              features between two languages could be an advan-
                  tem submitted to the task, with some performing          tage to leverage the accuracy of NMT systems for
                  muchbetter out-of-sample than others. The NUIG-          closely related languages.
                  Panlingua-KMI subword NMT system took 8th po-            Acknowledgments
                  sition for both Hindi-Marathi and Marathi-Hindi
                  language pair, across 14 teams. Our subword NMT          This publication has emanated from research in
                  systems for Marathi-Hindi language pair showed           part supported by the Irish Research Council un-
                  better results in terms of all the three metrics (17.39  der grant number SFI/18/CRT/6223 (CRT-Centre
                  in BLEU,58.84inRIBESand81.15inTER)while                  for Research Training in Artificial Intelligence) co-
                  the Hindi-Marathi language pair scored 9.76 in           funded by the European Regional Development
                  BLEU,52.18inRIBESand91.24inTER.Across                    FundaswellasbytheEUH2020programmeun-
                                                                       421
The words contained in this file might help you see if this file matches what you are looking for:

...Nuig panlingua kmihindi marathimtsystemsforsimilar languagetranslation task wmt atul kr ojha priya rani akanksha bansal bharathi raja chakravarthi ritesh kumar john p mccrae data science institute galway language processing llp newdelhi dr bhimrao ambedkar university agra atulkumar insight centre org outlook com nuigalway ie llh jnu ac in abstract as opennmt kleinetal marian junczys kmi submission to dowmunt et al and neamtus sennrich seeks push the state of art which provide various ways ex similar translation for hindi perimenting with use different features marathi pair part these ef architectures yet it fails achieve same re forts we conducted a series experiments sults low resource languages address challenges be b however zhang tween among mt revisited nmt models tuned hyper systems prepared this pbsmt sys parameters changed network op tems were each timize nmtfor conditions con nmtsystemsweredevelopedforhindi cluded that is very sensitive using byte encoding bpe such subwords re...

no reviews yet
Please Login to review.