137x Filetype PDF File size 0.42 MB Source: www.statmt.org
NUIG-Panlingua-KMIHindi↔MarathiMTSystemsforSimilar LanguageTranslation Task @ WMT2020 !+ ! + ! Atul Kr. Ojha , Priya Rani , Akanksha Bansal , Bharathi Raja Chakravarthi , * ! Ritesh Kumar , John P. McCrae !Data Science Institute, NUIG, Galway, +Panlingua Language Processing LLP, NewDelhi,*Dr. Bhimrao Ambedkar University, Agra (atulkumar.ojha,priya.rani,bharathi.raja)@insight-centre.org, panlingua@outlook.com,john.mccrae@nuigalway.ie, ritesh78 llh@jnu.ac.in Abstract as OpenNMT(Kleinetal.,2017),Marian(Junczys- NUIG-Panlingua-KMI submission to WMT Dowmunt et al., 2018) and Neamtus (Sennrich 2020 seeks to push the state-of-the-art in the et al., 2017), which provide various ways of ex- Similar language translation task for the Hindi perimenting with the use of different features and ↔Marathi language pair. As part of these ef- architectures, yet it fails to achieve the same re- forts, we conducted a series of experiments sults with low resource languages (Chakravarthi to address the challenges for translation be- et al., 2018, 2019b). However, Sennrich and Zhang tween similar languages. Among the 4 MT (2019) revisited the NMT models and tuned hyper- systems prepared for this task, 1 PBSMT sys- parameters, changed network architectures to op- tems were prepared for Hindi ↔ Marathi each timize NMTfor low-resource conditions and con- and 1 NMTsystemsweredevelopedforHindi cluded that low-resource NMT is very sensitive ↔Marathi using Byte Pair Encoding (BPE) to hyper-parameters such as Byte Pair Encoding of subwords. The results show that different architectures in NMT could be an effective (BPE) vocabulary size, word dropout, and others. methodfordevelopingMTsystemsforclosely This paper is an extension of our work Ojha et al. related languages. Our Hindi-Marathi NMT (2019) submitted to WMT 2019 similar language th systemwasranked8 amongthe14teamsthat translation task. Therefore our team adapted meth- participated and our Marathi-Hindi NMT sys- ods of the low resource setting for NMT proposed th temwasranked8 amongthe11teamspartic- by Sennrich and Zhang (2019) to explore the fol- ipated for the task. lowing broad objectives: 1 Introduction Developing automated relations between closely • to compare the performance of SMT and related languages is a contemporary concern espe- NMTincaseofcloselyrelated,relatively low- cially in the domain of Machine Translation(MT). resourced language pairs, and Hindi and Marathi exhibit a significant overlap in • to findout how to leverage the accuracy of their vocabularies and strong syntactic plus lexi- NMTincloselyrelated languages using BPE cal similarities. These striking similarities seem into subwords. promising in enhancing the possibility of mutual inter-comprehension within closely related lan- • to analyze the effects of data quality in perfor- guages. However, automated translation between manceofthesystems. such closely related languages is a rather challeng- ing task. 2 SystemDescription Thelinguistic similarities and regularities in mor- phological variations and orthography motivate the This section provides an overview of the systems use of character-level translation models, which developedfortheWMT2020SharedTask. Inthese have been applied to translation (Vilar et al., experiments, the NUIG-Panlingua-KMI team ex- 2007; Chakravarthi et al., 2020) and translitera- plored two different approaches: phrase-based sta- tion (Matthews, 2007; Chakravarthi et al., 2019a; tistical (Koehn et al., 2003), and neural method for Chakravarthi, 2020). In the past few years, neu- Hindi-Marathi and Marathi-Hindi language pairs. ral machine translation systems have achieved In all the submitted systems, we use the Moses outstanding performance with high resource lan- (Koehn et al., 2007) and Nematus (Sennrich et al., guages, with the help of open source toolkit such 2017) toolkit for developing statistical and neural 418 Proceedings of the 5th Conference on Machine Translation (WMT), pages 418–423 c Online, November 19–20, 2020. 2020 Association for Computational Linguistics machine translation systems respectively. The pre- Outof43274training sentences, the Hindi corpus processing was done to handle noise in data (for hadTelugusentences while the Marathi corpus had example, different language sentences, non-UTF Meitei sentences intermingled as shown in first row characters etc), the details of which are provided in (Figure 1). The parallel data had more than 1192 section 3.1 lines that were not comparable with each other as 2.1 Phrase-based SMTSystems shown in second and third row (Figure 1), where someHindisentences had only half the sentences ThesesystemswerebuiltontheMosesopensource translated in Marathi (second row) and some had toolkit using the KenLM (Heafield, 2011) language blank spaces against their Marathi counter parts modelandGIZA++(OchandNey,2003)aligner. (third row). The translation quality of the parallel ‘Grow-diag-final-and heuristic’ parameters were datawasalsonotuptomark. Infact,theteamcould used to extract phrases from the corresponding par- locate a few instances of synthetic data. There were allel corpora. In addition to this, KenLM was used a few sentences where character encoding was an to build 5-gram language models. issue, hence were completely unintelligible. 2.2 Neural Machine Translation System LanguagePair Training Tuning Monolingual Nematus was used to build 2 NMT systems. As Hindi ↔Marathi 43274 1411 - Marathi - - 326748 we mentioned in an earlier section, at first data Hindi - - 75348193 waspre-processed at subwords level with BPE for Table 1: Statistics of Parallel and Monolingual Sen- neural translation, and then the system was trained tences of the Hindi and Marathi Languages using Nematus toolkit. Most of the system features were adopted from (Sennrich et al., 2017; Koehn and Knowles, 2017) (see section 3.3.2). 3.2 Pre-processing 2.3 Assessment Thefollowingpre-processingstepswereperformed Assessment of these systems was done on the stan- as part of the experiments: dard automatic evaluation metrics: BLEU (Pap- ineni et al., 2002), Rank-based Intuitive Bilingual a) Bothcorporaweretokenizedandcleaned(sen- Evaluation Score (RIBES) (Isozaki et al., 2010) tences of length over 80 words were removed). and Translation Error Rate (TER) (Snover et al., 2006). b) For neural translation, training, validation and test data was prepossessed into subwords BPE 3 Experiments format. This format was utilised to prepare This section briefly describes the experiment set- BPEandvocabularyfurther used. tings for developing the systems. All these processes were performed using Moses 3.1 DataPreparations scripts. However, the tokenization was done by the RGNLPteamtokenizer(Ojhaetal., 2018) and In- Theparallel data-set for these experiments was pro- 3 dic nlp library. These tokenizers were used since vided by the WMT Similar Translation Shared Task Mosesdoesnot provide a tokenizer for Indic lan- 1 organisers and the Marathi monolingual data-set guages. Also the RGNLP tokenizer ensured that wastaken from WMT2020SharedTask: Parallel the canonical Unicode representation of the charac- 2 Corpus Filtering for Low-Resource Conditions. ters are retained. Theparallel data was sub-divided into training, tun- ing, and monolingual sets, as detailed in Table 1. 3.3 DevelopmentoftheNUIG-Panlingua- However, the shared data was very noisy. KMIMTSystems To enhance the data quality, the team had to After removing noisy and pre-processing data, the undertake an extensive pre-processing session fo- following steps were followed to build the NUIG- cused on identifying and cleaning the data-sets. Panlingua-KMI MTsystems: 1http://www.statmt.org/wmt20/similar. html 3https://github.com/anoopkunchukuttan/ 2https://wmt20similar.cs.upc.edu/ indic_nlp_library 419 Figure 1: Examples of discrepancies in Hindi-Marathi parallel data Figure 2: Analysis of the PBSMT and NMT’s Systems 3.3.1 Building Primary MT Systems: guage models were trained on 5-gram. After that, As previously mentioned, the Hindi-Marathi and thesystemswerebuiltindependentlyandcombined Marathi-Hindi PBSMT systems were built as the in a loglinear scheme in which each model was as- primary submission using Moses. The language signed a different weight using the Minimum Error modelwasbuilt first, using KenLM. For Marathi- Rate Training (Och, 2003) tuning algorithm. To Hindi and Hindi-Marathi language pairs, the lan- train and tune the systems, we used 40454 and 1411 420 parallel sentences, respectively, for all language both the language pairs, subword based NMT per- pairs. formedbetter than PBSMTasits accuracy rate was 3.3.2 Building Contrastive MT Systems: higher in BLEU and lower in TER metrics, shown As mentioned in the previous section, Nematus in Table 2. toolkit was used to develop the NMT systems. The 4.2 Analysis training was done on subword and character-level. Weusedthereference set provided by the shared All the NMT experiments were carried out only task organizers to evaluate both PBSMT and NMT with a data-set that contained sentences with length systems. Even though subword based NMT system of up to 80 words. The neural model is trained on could take advantage of the shared features among 5000epochs, using Adam with a default learning similar languages, challenges in translating a few rate of 0.002, dropout at 0.01 and mini-batches of linguistics structures acted as a constraint. Exam- 80 and the batch size for the validation was 40. ple 1 shown in Figure 2 is one of the challenging Vocabulary size of 30000 for both Marathi-Hindi structures that the system was unable to translate. and Hindi-Marathi language pairs was extracted. In these sentences the systems could not capture Remainingparameters were limited with the use of the correct tense and aspect which is past perfect default hyper-parameters configuration. in source sentence whereas the NMT system trans- 4 Evaluation lated it as simple past. The second most common challenging structures that needed special attention All the systems were evaluated using the reference were the postpositions as shown in Example 2 and set provided by the shared task organizers. The 3 in the figure. In most cases, the system over- standard MT evaluation metrics, BLEU (Papineni generalised the sentences in Marathi and generated et al., 2002) score, RIBES (Isozaki et al., 2010) and unnecessary postposition phrases in Hindi as in Ex- TER(Snoveretal., 2006), were used for automatic ample 2. Similarly, we can see in Example 3 while evaluation. These results were prepared on the Pri- translating from Hindi to Marathi both PBSMT and maryandContrastivesystemsubmissionwhichare NMTsystemsusedwrongpost-positions. mentioned in the Table 2 as P and C, where P stands for Primary and C stands for Contrastive, 5 Conclusion respectively. It gives a quantitative picture of partic- Ourexperiment results reveal that subword based ular differences across different systems, especially NMTcouldtakeadvantageoftherelation between with reference to evaluation scores (Table 2) the similar language to boost the accuracy of neural System BLEU RIBES TER machine translations system in low resource data Hindi-Marathi P 9.38 51.88 91.24 settings. As BPE units are variable-length units Hindi-Marathi C 9.76 52.18 91.49 and the vocabularies used are much smaller than Marathi-Hindi P 17.38 59.31 81.47 morphemeandword-levelmodel,theproblemof Marathi-Hindi C 17.39 58.84 81.15 Table 2: Accuracy of Hindi↔Marathi MT Systems at data sparsity does not occur. On the contrary, it BLEU,RIBESandTERMetrics provides an appropriate context for translation be- tween similar languages. However, the quality of datausedtotrainthesystemsdoesaffectthequality 4.1 Results of translation. Thus, we could conclude that shared Overall weseevaryingperformanceamongthesys- features between two languages could be an advan- tem submitted to the task, with some performing tage to leverage the accuracy of NMT systems for muchbetter out-of-sample than others. The NUIG- closely related languages. Panlingua-KMI subword NMT system took 8th po- Acknowledgments sition for both Hindi-Marathi and Marathi-Hindi language pair, across 14 teams. Our subword NMT This publication has emanated from research in systems for Marathi-Hindi language pair showed part supported by the Irish Research Council un- better results in terms of all the three metrics (17.39 der grant number SFI/18/CRT/6223 (CRT-Centre in BLEU,58.84inRIBESand81.15inTER)while for Research Training in Artificial Intelligence) co- the Hindi-Marathi language pair scored 9.76 in funded by the European Regional Development BLEU,52.18inRIBESand91.24inTER.Across FundaswellasbytheEUH2020programmeun- 421
no reviews yet
Please Login to review.