181x Filetype PDF File size 0.20 MB Source: aclanthology.org
LanguageRelatednessandLexicalClosenesscanhelpImprove Multilingual NMT: IITBombay@MultiIndicNMTWAT2021 Jyotsana Khatri, Nikhil Saini, Pushpak Bhattacharyya Department of Computer Science and Engineering Indian Institute of Technology Bombay Mumbai,India {jyotsanak, nikhilra, pb}@cse.iitb.ac.in Abstract In this paper, we present our system for Multi- Multilingual Neural Machine Translation has IndicMT: An Indic Language Multilingual Task at achieved remarkable performance by training WAT2021(Nakazawaetal.,2021). Thetaskcovers a single translation model for multiple lan- 10 Indic Languages (Bengali, Gujarati, Hindi, Kan- guages. This paper describes our submission nada, Malayalam, Marathi, Oriya, Punjabi, Tamil, (TeamID:CFILT-IITB)fortheMultiIndicMT: and Telugu) and English. AnIndic Language Multilingual Task at WAT To summarize our approach and contributions, 2021. We train multilingual NMT systems by we (i) present a multilingual NMT system with sharing encoder and decoder parameters with shared encoder-decoder framework, (ii) show re- language embedding associated with each to- sults on many-to-one translation, (iii) use transliter- ken in both encoder and decoder. Further- ation to a common script to handle the lexical gap more, we demonstrate the use of translitera- between languages, (iv) show how grouping of lan- tion (script conversion) for Indic languages in guagesinregardtotheirlanguagefamilyhelpsmul- reducing the lexical gap for training a multilin- gual NMTsystem. Further, we show improve- tilingual NMT and (v) use language embeddings mentinperformancebytrainingamultilingual with each token in both encoder and decoder. NMTsystemusinglanguagesofthesamefam- ily, i.e., related languages. 2 Related work 1 Introduction 2.1 Neural Machine Translation NeuralMachineTranslation(Sutskeveretal.,2014; Neural Machine Translation architectures consist Bahdanauetal., 2015; Wu et al., 2016) has become of encoder layers, attention layers, and decoder lay- a de-facto for automatic translation of language ers. NMT framework takes a sequence of words pairs. NMT systems with Transformer (Vaswani as an input; the encoder generates an intermediate et al., 2017) based architectures have achieved com- representation, conditioned on which, the decoder petitive accuracy on data-rich language pairs like generates an output sequence. The decoder also at- English-French. However, NMT systems are data- tends to the encoder states. Bahdanau et al. (2015) hungry, and only a few pairs of languages have introduced the encoder-decoder attention to allow abundant parallel data. For low resource setting, the decoder to soft-search the parts of the source techniques like transfer learning (Zoph et al., 2016) sentence to predict the next token. The encoder- and utilization of monolingual data in an unsuper- decoder can be a LSTM framework (Sutskever vised setting (Artetxe et al., 2018; Lample et al., et al., 2014; Wu et al., 2016), CNN (Gehring et al., 2017, 2018) have shown support for increasing 2017), or Transformer layers (Vaswani et al., 2017). the translation accuracy. Multilingual Neural Ma- ATransformer layer comprises of self-attention chine Translation is an ideal setting for low re- that bakes the understanding of input sequence with source MT (Lakew et al., 2018) since it allows positional encoding and passes on to the next com- sharing of encoder-decoder parameters, word em- ponent, feed-forward neural network, layer normal- beddings, and joint or separate vocabularies. It ization, and residual connections. The decoder in also enables zero-shot translations, i.e., translating the transformer has an additional encoder-attention between language pairs that were not seen during layer that attends to the output states of the trans- training (Johnson et al., 2017a). former encoder. 217 Proceedings of the 8th Workshop on Asian Translation, pages 217–223 Bangkok, Thailand (online), August 5-6, 2021. ©2021 Association for Computational Linguistics NMTisdata-hungry,andonlyafewpairsoflan- 2.3 LanguageRelatedness guages have abundant parallel data. In recent years, Telugu, Tamil, Kannada, and Malayalam are Dra- NMThasbeenaccompaniedbyseveraltechniques vidian languages whose speakers are predomi- to improve the performance of both low & high nantly found in South India, with some speakers in resource language pairs. Back-translation (Sen- Sri Lanka and a few pockets of speakers in North nrich et al., 2016b) is used to augment the paral- India. The speakers of these languages constitute lel data with synthetically generated parallel data around 20% of the Indian population (Kunchukut- bypassing monolingual datasets to the previously tan and Bhattacharyya, 2020). Dravidian languages trained models. Currently, NMT systems also per- are agglutinative, i.e., long and complex words are form on-the-fly back-translation to train the model formed by stringing together morphemes without simultaneously. Tokenization methods like Byte changing them in spelling or phonetics. Most Dra- Pair Encoding (Sennrich et al., 2016a) are used in vidian languages have clusivity distinction. Hindi, almost all NMT models. Pivoting (Cheng et al., Bengali, Marathi, Gujarati, Oriya, Punjabi are Indo- 2017) and Transfer Learning (Zoph et al., 2016) AryanlanguagesandareprimarilyspokeninNorth have leveraged the language relatedness by indi- and Central India and the neighboring countries rectly providing the model with more parallel data of Pakistan, Nepal, and Bangladesh. The speakers from related language pairs. of these languages constitute around 75% of the Indian population. Both Dravidian and Indo-Aryan 2.2 Multilingual Neural Machine Translation language families follow the Subject(S)-Object(O)- Verb(V) order. Multilingual NMT trains a single model utilizing Grouping languages concerning their families data from multiple language-pairs to improve the have inherent advantages because they form a performance. There are different approaches to closely related group with several linguistic phe- incorporate multiple language pairs in a single nomenonssharedamongstthem. Indo-Aryan lan- system, like multi-way NMT, pivot-based NMT, guages are morphologically rich and have huge transfer learning, multi-source NMT and, multi- similarities when compared to English. A language lingual NMT (Dabre et al., 2020). Multilingual group also share vocabularies at both word and NMTcameintopicture because many languages character level. They contain similarly spelled share certain amount of vocabulary and share some words that are derived from the same root. ‘ structural similarity. These languages together can 2.4 Transliteration be utilized to improve the performance of NMT Indic languages share a lot of vocabulary, but most systems. In this paper, our focus is to analyze the languages utilize different scripts. Nevertheless, performance of multi-source NMT. The simplest these scripts have phoneme overlap and can be approach is to share the parameters of NMT model converted easily from one to another using a simple across multiple language pairs. These kinds of sys- rule-based system. To convert all Indic language tems work better if languages are related to each 1 other. In Johnson et al. (2017b), the encoder, de- data into the same script, we use IndicNLP which coder, and attention are shared for the training of maps different Unicode range for the conversion. multiple language pairs and a target language to- Theconversion of all Indic language scripts to the ken is added at the beginning of target sentence same script helps with better shared vocabulary while decoding. Firat et al. (2016) utilizes a shared and leads to smaller subword vocabulary (Ramesh attention mechanism to train multilingual models. et al., 2021). Recently many approaches have been proposed, 3 Systemoverview where monolingual data of multiple languages is In this section, we describe the details of the sub- utilized to pre-train a single model using different mitted systems to MultiIndicMT task at WAT2021. objectives like masked language modeling and de- Wereport results for four types of models: noising (Lample and Conneau, 2019; Song et al., 2019; Lewis et al., 2020; Liu et al., 2020). Multi- • Bilingual: Trainedonlyusingparalleldatafor lingual pre-training followed by multilingual fine- a particular language pair (bilingual models). tuning has also proven to be beneficial (Tang et al., 1https://github.com/anoopkunchukuttan/ 2020). indic_nlp_library 218 • All-En: Multilingual many-to-one system of all languages into the same script, hence the trained using all available parallel data of all choice of Devnagari as a common script is arbi- 3 language pairs. trary. We use fastBPE to learn BPE (Byte pair • IA-En: Multilingual many-to-one system encoding) (Bojanowski et al., 2017). For bilin- trained using Indo-Aryan languages from the gual models, we use 60000 BPE codes over the provided parallel data. combined tokenized data of both languages. The numberofBPEcodesissetto100000forAll-En, • DR-En: Multilingual many-to-one system and 80000 for DR-En and IA-En. trained using Dravidian languages from the 4.3 Experimental Setup provided parallel data. Weusesixlayers in the encoder, six layers in the Totrain our multilingual models, we use shared decoder, 8 attention heads in both encoder and de- encoder-decoder transformer architecture. To han- coder, and 1024 embedding dimension. The en- dle the lexical gap between Indic languages in mul- coderanddecoderaretrainedusingAdam(Kingma tilingual models, we convert the data of all Indic and Ba, 2015) optimizer with inverse square root languages to a common script. We choose the learning rate schedule. We use the same setting common script as Devanagari (arbitrary choice). as used in Song et al. (2019) for warmup phase, Wealso perform a comparative study of systems in which the learning rate is increased linearly for whentheencoderanddecoderaresharedonlybe- − some initial steps starting from 1e 7 to 0.0001, tween related languages. To perform this com- warmup phase is set to 4000 steps. We use mini- parative study, we group the provided set of lan- batches of size 2000 tokens and set the dropout guages in two parts based on the language families to 0.1 (Gal and Ghahramani, 2016). Maximum they belong to, i.e, the system is trained from Indo- sentence length is set to 100 after applying BPE. Aryan (group) to English, and Dravidian (group) At decoding time, we use greedy decoding. For to English. Indo-Aryan-to-English contains Ben- 4 experiments, we are using mt steps from MASS gali, Gujarati, Hindi, Marathi, Oriya, Punjabi to codebase. Our models are trained using only par- English, and Dravidian-to-English contains Kan- allel data provided in the task, we are not training nada, Malayalam, Tamil, Telugu to English. We the model using any kind of pretraining objective. use shared subword vocabulary of the languages Wetrain bilingual models for 100 epochs and mul- involved while training multilingual models, and a tilingual models for 150 epochs. The epoch size commonvocabularyofsourceandtargetlanguages is set to 200000 sentences. Due to resource con- to train bilingual models. straints, we train our model for fixed number of 4 Experimental details epochs, it does not guarantee convergence. Similar to MASS(Songetal.,2019),languageembeddings 4.1 Dataset are added to each token in the encoder and decoder Ourmodelsaretrained using only the parallel data to distinguish between languages. These language provided for the task. The size of the parallel data embeddings are learnt during training. available and its source of origin are summarized 4.4 Results and Discussion in Table 1. The validation and test data provided in Wereport BLEUscores for our four settings: bilin- the task is n-way and contains 1000 sentences for gual, All-En (multilingual many-to-one), IA-En validation and 2390 sentences in test set. (multilingual many-to-one Indo-Aryan to English), 4.2 Datapreprocessing and DR-En (multilingual many-to-one Dravidian to English) in Table 2. We use multi-bleu.perl 5 to We tokenize English language data using moses calculate BLEU scores of baseline models. BLEU tokenizer (Koehn et al., 2007), and Indian language score is calculated using the tokenized reference data using IndicNLP2 library. For multilingual models, we transliterate (script mapping) all In- and hypothesis files as followed by organizers in dic language data into Devanagari script using the 3https://github.com/glample/fastBPE IndicNLPlibrary. Our aim here is to convert data 4https://github.com/microsoft/MASS 5https://github.com/moses-smt/ 2https://github.com/anoopkunchukuttan/ mosesdecoder/blob/RELEASE-2.1.1/scripts/ indic_nlp_library generic/multi-bleu.perl 219 LangPair Size Datasources bn-en 1.70M alt, cvit-pib, jw, opensubtitles, pmi, tanzil, ted2020, wikimatrix gu-en 0.51M bibleuedin, cvit, jw, pmi, ted2020, urst, wikititles hi-en 3.50M alt, bibleuedin, cvit-pib, iitb, jw, opensubtitles, pmi, tanzil, ted2020, wikimatrix kn-en 0.39M bibleuedin, jw, pmi, ted2020 ml-en 1.20M bibleudein, cvit-pib, jw, opensubtitles, pmi, tanzil, ted2020, wikimatrix mr-en 0.78M bibleuedin, cvit-pib, jw, pmi, ted2020, wikimatrix or-en 0.25M cvit, mtenglish2odia, odiencorp, pmi pa-en 0.51M cvit-pib, jw, pmi, ted2020 ta-en 1.40M cvit-pib, jw, nlpc, opensubtitles, pmi, tanzil, ted2020, ufal, wikimatrix, wikititles te-en 0.68M cvit-pib, jw, opensubtitles, pmi, ted2020, wikimatrix Table 1: Parallel Dataset amongst 10 Indic-English language pairs. Size is the number of parallel sentences (in millions). (bn, gu, hi, kn, ml, mr, or, pa, ta, te and en: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu and English respectively BLEU AMFM LangPair Bilingual IA-En DR-En All-En IA-En DR-En All-En bn-en 18.52 20.18 - 18.48 0.734491 - 0.730379 gu-en 26.51 31.02 - 28.79 0.776935 - 0.765441 hi-en 33.53 33.7 - 30.9 0.791408 - 0.775032 mr-en 21.28 25.5 - 23.57 0.767347 - 0.751917 or-en 22.6 26.34 - 25.05 0.780009 - 0.770941 pa-en 29.92 32.34 - 29.87 0.782112 - 0.772655 kn-en 17.93 - 24.18 24.01 - 0.744802 0.751223 ml-en 19.52 - 22.84 22.1 - 0.745908 0.744459 ta-en 23.62 - 22.75 21.37 - 0.74509 0.742311 te-en 19.89 - 24.02 22.37 - 0.745885 0.743435 Table 2: Results: XX-en is the translation direction. IA, DR, All are Indo-Aryan, Dravidian and All Indic lan- guages respectively. The numbers under BLEU and AMFM headings represent BLEU score and AMFM score respectively. the evaluation of MultiIndicMT task6. Tokeniza- The BLEU score in table 2 highlights that the tion is performed using moses-tokenizer (Koehn multilingual model outperforms the simpler bilin- et al., 2007). For IA-En, DR-En, and All-En, we re- gual models. Although we did not submit bilingual port results provided by the organizers. Table 2 also models in the shared task submission, we use it reports the Adequacy-Fluency Metrics (AM-FM) here as a baseline to compare with multilingual for Machine Translation (MT) Evaluation (Banchs models. Moreover, upon grouping languages based et al., 2015) provided by organizers. ontheirlanguagefamilies,significantimprovement in BLEUscores is observed due to less confusion 6http://lotus.kuee.kyoto-u.ac.jp/WAT/ and better learning of the language representations evaluation/automatic_evaluation_systems/ in shared encoder-decoder architecture. We ob- automaticEvaluationEN.html 220
no reviews yet
Please Login to review.