151x Filetype PDF File size 0.14 MB Source: aclanthology.org
NMTbasedSimilarLanguageTranslationforHindi-Marathi VandanMujadiaandDiptiMisraSharma Machine Translation - Natural Language Processing Lab Language Technologies Research Centre Kohli Center on Intelligent Systems International Institute of Information Technology - Hyderabad vandan.mu@research.iiit.ac.in,dipti@iiit.ac.in Abstract Data Sents Token Type Hindi (Parallel) 38,246 7.6M 39K This paper describes the participation of team Marathi (Parallel) 38,246 5.6M 66K F1toF6(LTRC,IIIT-Hyderabad)fortheWMT Hindi (Mono) 80M - - 2020 task, similar language translation. We Marathi (Mono) 3.2M - - experimented with attention based recurrent neural network architecture (seq2seq) for this Table 1: Hindi-Marathi WMT2020 Training data task. We explored the use of different lin- guistic features like POS and Morph along with back translation for Hindi-Marathi and translation. Marathi-Hindi machine translation. 1 Introduction This paper describes our experiments for the task of similar language translation of WMT-2020. Machine Translation (MT) is the field of Natural WefocusedonlyonHindi-Marathi language pair Language Processing which aims to translate for the translation task (both directions). The a text from one natural language (i.e Hindi) to origin of these two languages are the same as they another (i.e Marathi). The meaning of the resulting are Indo-aryan languages(wikipedia, 2020). Hindi translated text must be fully preserved as the is said to have evolved from Sauraseni Prakrit source text in the target language. (wikipedia Hindi, 2020) whereas Marathi is said to have evolved from Maharashtri Prakrit (wikipedia For the translation task, different types of ma- Marathi, 2020). They also have evolved as two chine translation systems have been developed and major languages in different regions of India. they are mainly Rule based Machine Translation Inthiswork,wefocusedonlyonrecurrentneural (RBMT)(Forcada et al., 2011), Statistical Machine network with attention based sequence to sequence Translation (SMT) (Koehn, 2009) and Neural architecture throughout all experiments. Along Machine Translation (NMT) (Bahdanau et al., with it, we also explored the morph(Virpioja et al., 2014). 2013) induced sub-word segmentation with byte Statistical Machine Translation (SMT) aims to pair encoding (BPE)(Sennrich et al., 2016b) to en- learn a statistical model to determine the correspon- able open vocabulary translation. We used POS dence between a word from the source language tags as linguistic feature and back translation to and a word from the target language. Neural leverage synthetic data for machine translation task Machine Translation is an end to end approach in both directions. In the similar language transla- for automatic machine translation without heavily tion task of WMT-2020, we participated as team hand crafted feature engineering. Due to recent named“f1plusf6”. advances, NMT has been receiving heavy attention 2 Data and achieved state of the art performance in the task of language translation. With this work, we Weutilised parallel and monolingual corpora pro- intend to check how NMT systems could be used vided for the task on Hindi<->Marathi language for low resource and similar language machine pairs. Table-1 describes the training data (parallel 414 Proceedings of the 5th Conference on Machine Translation (WMT), pages 414–417 c Online, November 19–20, 2020. 2020 Association for Computational Linguistics andmonolingual)onwhichwecarriedoutallexper- (3) aur jab maan@@saa##haaree iments. We deliberately excluded Indic WordNet pakshee loth##on par jha@@ pat##e , data from the training after doing manual quality tab ab##raam ne unhen uda diya . check. As this is a constrained task, our experi- ‘And when the carnivorous birds swooped on ments do not utilise any other available data. the carcasses, Abram blew them away.’ 3 Pre-Processing Wedemonstrate this method with a Hindi sen- tence as given in Example-1. Example -1, shows As a first pre-precessing step we use IndicNLP Hindi text with romanized text and the correspond- 1 Toolkit along with an in-house tokenizer to tok- ing English translation for better understanding. enize and clean both Hindi and Marathi corpora TheExample-2showsthesamesentencewithMor- (train, test, dev and monolingual). fessor based segmentation with token ##. Here 3.1 Morph+BPESegmentation wenotice that Morfessor model has segmented the Hindi words into meaningful stems and suffixes. Marathi and Hindi are morphologically rich i.e maansaahaaree=maansaa + haaree(meat + who languages and from the Table-1, based on the the eats ). We would like to use it in our experiments to comparative token/type ratio, one can find that tackle the difficulties that arise due to complex mor- Marathi is a more agglutinative language than phology at the source language in machine trans- Hindi. Translating from morphologically-rich lation tasks. On top of this morph segmented text agglutinative languages is more difficult due to weapplied BPE (Sennrich et al., 2016a) as given their complex morphology and large vocabulary. in Example-3. Here @@ is sub-word separator for To address this issue, we have come up with a byte pair based segmentation and ## is the separa- segmentation method which is based on morph tor for morph based segmentation. and BPEsegmentation (Sennrich et al., 2016b) as 3.2 Features a pre-processing step. For Hindi to Marathi translation, we carried out In this method, we utilised unsupervised Mor- experiments using Part of Speech (POS) tags as a fessor (Virpioja et al., 2013) to train a Morfessor word level as well as a subword level feature as modelonmonolingualdataforbothlanguages. We described in (Sennrich and Haddow, 2016). We 2 then applied this trained Morfessor model on our use LTRCshallow parser toolkit to get POS tags. corpora (train, test, validation) to get meaningful 4 Training Configuration stem, morpheme, suffix segmented sub-tokens for each word in each sentence. Recurrent Neural Network (RNN) based machine translation models work on encoder-decoder based architecture. Here, the encoder takes the input (1) aur jab maansaahaaree (source sentence) and encodes it into a single vec- pakshee lothon par jhapate , tor (called as a context vector). Then the decoder tab abraam ne unhen uda diya . takes this context vector to generate an output se- ‘And when the carnivorous birds swooped on quence (target sentence) by generating a word at the carcasses, Abram blew them away.’ a time(Sutskever et al., 2014). Attention mecha- nism is an extension to this sequence to sequence architecture to avoid attempting to learn a single (2) aur jab maansaa##haaree vector. Instead, based on learnt attention weights, pakshee loth##on par jhapat##e , it focuses more on specific words at the source end tab ab##raam ne unhen uda diya . andgenerates a word at a time. More details can be ‘And when the carnivorous birds swooped on found here (Bahdanau et al., 2014), (Luong et al., the carcasses, Abram blew them away.’ 2015). For our experiments, we utilize sequence to se- quence NMT model with attention for all of our experiments with following configuration. 1http://anoopkunchukuttan.github.io/indic nlp library/ 2http://ltrc.iiit.ac.in/analyzer/ 415 Model Feature BPE(Mergeops) BLEU BiLSTM+LuongAttn Wordlevel - 19.70 BiLSTM+LuongAttn Word+SharedVocab(SV)+POS - 20.49 BiLSTM+LuongAttn BPE 10K 20.1 BiLSTM+LuongAttn BPE+SV+MORPHSegmentation 10K 20.44 BiLSTM+LuongAttn BPE+SV+MORPH+POS 10K 20.62 BiLSTM+LuongAttn BPE+SV+MORPH+POS+BT 10K 16.49 Table 2: BLEU scores on Development data for Hindi-Marathi Model Feature BPE(Mergeops) BLEU BiLSTM+LuongAttn Wordlevel - 21.42 BiLSTM+LuongAttn Word+SharedVocab(SV) - 23.84 BiLSTM+LuongAttn BPE 20K 24.56 BiLSTM+LuongAttn BPE+SV+MORPHSegmentation 20K 25.36 BiLSTM+LuongAttn BPE+SV+MORPH+POS 20K 25.55 BiLSTM+LuongAttn BPE+SV+MORPH+POS+BT 20K 23.80 Table 3: BLEU scores on Development data for Marathi-Hindi • Morph + BPE based subword segmentation, ongiven training data for a direction (i.e, Marathi POStagsasfeature to Hindi) to enrich training data of the opposite • Embedding size : 500 directional NMT training (i.e, Hindi - Marathi) by populating synthetic data. We used around 5M • RNNforencoderanddecoder: bi-LSTM back translated pairs (after perplexity based prun- ing with respect to sentence length) for both trans- • Bi-LSTMdimension: 500 lation directions. • encoder - decoder layers : 2 Using above described configuration, we per- formed experiments based on different parameter • Attention : luong (general) (feature) configurations. We trained and tested our modelsonwordlevel,BPElevelandmorph+BPE • copyattention(Guetal.,2016)ondynamically level for input and output. We also used POS tagger generated dictionary and experimented with shared vocabulary across • label smoothing : 1.0 the translation task. The results are discussed in following Result section. • dropout : 0.30 6 Result • Optimizer : Adam Table-2 and Table-3 show performance of systems • Beamsize : 4 (train) and 10 (test) with different configuration in terms of BLEU score(Papineni et al., 2002) for Hindi-Marathi and Asthesearetwosimilarlanguages,sharewriting Marathi-Hindi respectively on the validation data. scripts and large sets of named entities, we used We achieved 20.62 and 25.55 development and shared vocab across training. We used Opennmt-py 5.94 and 18.14 test BLEU scores for Hindi-Marathi (Klein et al., 2020) toolkit with above configuration and Marathi-Hindi systems respectively. for our experiments. 5 BackTranslation The results show that for low resource similar language settings, MT models based on sequence Backtranslationisawidelyuseddataaugmentation to sequence neural network can be improved method for low resource neural machine transla- with linguistic information like morph based tion(Sennrich et al., 2016a). We utilised monolin- segmentation and POS features. The results also gual data (i.e of Marathi) and a NMT model trained show that morph based segmentation along with 416 byte pair encoding improves BLEU score for both Philipp Koehn. 2009. Statistical machine translation. directions. But Marathi-Hindi directed translation Cambridge University Press. shows considerable improvement. Therefore our Minh-Thang Luong, Hieu Pham, and Christopher D method shows improvement while translating Manning. 2015. Effective approaches to attention- from morphologically richer language (Marathi) based neural machine translation. In Proceedings of to comparatively less morphologically richer the 2015 Conference on Empirical Methods in Natu- language (Hindi). ral Language Processing, pages 1412–1421. wikipedia Marathi. 2020. Maharashtri prakrit The results also suggest that the use of back - wikipedia. https://en.wikipedia.org/ translated synthetic data for low resource language wiki/Maharashtri_Prakrit. (Accessed on pairs reduces the overall performance marginally. 08/15/2020). Thereasonfor this could be, due to low quantity of Kishore Papineni, Salim Roukos, Todd Ward, and Wei- training data for NMT models, they could be over Jing Zhu. 2002. Bleu: a method for automatic eval- learning and back translation could be helping to uation of machine translation. In Proceedings of the do better generalization. 40th annual meeting of the Association for Compu- tational Linguistics, pages 311–318. 7 Conclusion Rico Sennrich and Barry Haddow. 2016. Linguistic input features improve neural machine translation. Weconcludefromourexperimentsthat linguistic In Proceedings of the First Conference on Machine feature driven NMT for similar low resource lan- Translation: Volume 1, Research Papers, pages 83– guages is a promising approach. We also believe 91. that morph+BPE based segmentation is a potential Rico Sennrich, Barry Haddow, and Alexandra Birch. segmentation method for morphologically richer 2016a. Improving neural machine translation mod- languages. els with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 86–96. References Rico Sennrich, Barry Haddow, and Alexandra Birch. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- 2016b. Neural machine translation of rare words gio. 2014. Neural machine translation by jointly with subword units. In Proceedings of the 54th An- learning to align and translate. arXiv preprint nual Meeting of the Association for Computational arXiv:1409.0473. Linguistics (Volume 1: Long Papers), pages 1715– 1725. ´ Mikel L Forcada, Mireia Ginestı-Rosell, Jacob Nord- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. falk, Jim O’Regan, Sergio Ortiz-Rojas, Juan An- Sequencetosequencelearningwithneuralnetworks. ´ ´ ´ tonio Perez-Ortiz, Felipe Sanchez-Martınez, Gema In Advances in neural information processing sys- ´ ´ Ramırez-Sanchez,andFrancisMTyers.2011. Aper- tems, pages 3104–3112. tium: a free/open-source platform for rule-based ma- chine translation. Machine translation, 25(2):127– ¨ 144. SamiVirpioja, Peter Smit, Stig-Arne Gronroos, Mikko Kurimo, et al. 2013. Morfessor 2.0: Python imple- Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK mentation and extensions for morfessor baseline. Li. 2016. Incorporating copying mechanism in wikipedia. 2020. Indo-aryan languages - sequence-to-sequence learning. In Proceedings of wikipedia. https://en.wikipedia.org/ the 54th Annual Meeting of the Association for Com- wiki/Indo-Aryan_languages. (Accessed on putational Linguistics (Volume 1: Long Papers), 08/17/2020). pages 1631–1640. wikipedia Hindi. 2020. Shauraseni prakrit - wikipedia. https://en.wikipedia.org/ wiki/Shauraseni_Prakrit. (Accessed on 08/15/2020). Guillaume Klein, Franc¸ois Hernandez, Vincent Nguyen, and Jean Senellart. 2020. The opennmt neural machine translation toolkit: 2020 edition. In Proceedings of the 14th Conference of the As- sociation for Machine Translation in the Americas (AMTA2020),pages102–109. 417
no reviews yet
Please Login to review.