Language Pdf 100642

Partial capture of text on file.
                           LanguageRelatednessandLexicalClosenesscanhelpImprove
                             Multilingual NMT: IITBombay@MultiIndicNMTWAT2021
                                         Jyotsana Khatri, Nikhil Saini, Pushpak Bhattacharyya
                                             Department of Computer Science and Engineering
                                                    Indian Institute of Technology Bombay
                                                                  Mumbai,India
                                        {jyotsanak, nikhilra, pb}@cse.iitb.ac.in
                                        Abstract                                In this paper, we present our system for Multi-
                       Multilingual Neural Machine Translation has           IndicMT: An Indic Language Multilingual Task at
                       achieved remarkable performance by training           WAT2021(Nakazawaetal.,2021). Thetaskcovers
                       a single translation model for multiple lan-          10 Indic Languages (Bengali, Gujarati, Hindi, Kan-
                       guages. This paper describes our submission           nada, Malayalam, Marathi, Oriya, Punjabi, Tamil,
                       (TeamID:CFILT-IITB)fortheMultiIndicMT:                and Telugu) and English.
                       AnIndic Language Multilingual Task at WAT                To summarize our approach and contributions,
                       2021. We train multilingual NMT systems by            we (i) present a multilingual NMT system with
                       sharing encoder and decoder parameters with           shared encoder-decoder framework, (ii) show re-
                       language embedding associated with each to-           sults on many-to-one translation, (iii) use transliter-
                       ken in both encoder and decoder.      Further-        ation to a common script to handle the lexical gap
                       more, we demonstrate the use of translitera-          between languages, (iv) show how grouping of lan-
                       tion (script conversion) for Indic languages in       guagesinregardtotheirlanguagefamilyhelpsmul-
                       reducing the lexical gap for training a multilin-
                       gual NMTsystem. Further, we show improve-             tilingual NMT and (v) use language embeddings
                       mentinperformancebytrainingamultilingual              with each token in both encoder and decoder.
                       NMTsystemusinglanguagesofthesamefam-
                       ily, i.e., related languages.                         2   Related work
                  1    Introduction                                          2.1   Neural Machine Translation
                  NeuralMachineTranslation(Sutskeveretal.,2014;              Neural Machine Translation architectures consist
                  Bahdanauetal., 2015; Wu et al., 2016) has become           of encoder layers, attention layers, and decoder lay-
                  a de-facto for automatic translation of language           ers. NMT framework takes a sequence of words
                  pairs. NMT systems with Transformer (Vaswani               as an input; the encoder generates an intermediate
                  et al., 2017) based architectures have achieved com-       representation, conditioned on which, the decoder
                  petitive accuracy on data-rich language pairs like         generates an output sequence. The decoder also at-
                  English-French. However, NMT systems are data-             tends to the encoder states. Bahdanau et al. (2015)
                  hungry, and only a few pairs of languages have             introduced the encoder-decoder attention to allow
                  abundant parallel data. For low resource setting,          the decoder to soft-search the parts of the source
                  techniques like transfer learning (Zoph et al., 2016)      sentence to predict the next token. The encoder-
                  and utilization of monolingual data in an unsuper-         decoder can be a LSTM framework (Sutskever
                  vised setting (Artetxe et al., 2018; Lample et al.,        et al., 2014; Wu et al., 2016), CNN (Gehring et al.,
                  2017, 2018) have shown support for increasing              2017), or Transformer layers (Vaswani et al., 2017).
                  the translation accuracy. Multilingual Neural Ma-          ATransformer layer comprises of self-attention
                  chine Translation is an ideal setting for low re-          that bakes the understanding of input sequence with
                  source MT (Lakew et al., 2018) since it allows             positional encoding and passes on to the next com-
                  sharing of encoder-decoder parameters, word em-            ponent, feed-forward neural network, layer normal-
                  beddings, and joint or separate vocabularies. It           ization, and residual connections. The decoder in
                  also enables zero-shot translations, i.e., translating     the transformer has an additional encoder-attention
                  between language pairs that were not seen during           layer that attends to the output states of the trans-
                  training (Johnson et al., 2017a).                          former encoder.
                                                                         217
                                           Proceedings of the 8th Workshop on Asian Translation, pages 217–223
                               Bangkok, Thailand (online), August 5-6, 2021. ©2021 Association for Computational Linguistics
                    NMTisdata-hungry,andonlyafewpairsoflan-              2.3   LanguageRelatedness
                  guages have abundant parallel data. In recent years,   Telugu, Tamil, Kannada, and Malayalam are Dra-
                  NMThasbeenaccompaniedbyseveraltechniques               vidian languages whose speakers are predomi-
                  to improve the performance of both low & high          nantly found in South India, with some speakers in
                  resource language pairs. Back-translation (Sen-        Sri Lanka and a few pockets of speakers in North
                  nrich et al., 2016b) is used to augment the paral-     India. The speakers of these languages constitute
                  lel data with synthetically generated parallel data    around 20% of the Indian population (Kunchukut-
                  bypassing monolingual datasets to the previously       tan and Bhattacharyya, 2020). Dravidian languages
                  trained models. Currently, NMT systems also per-       are agglutinative, i.e., long and complex words are
                  form on-the-ﬂy back-translation to train the model     formed by stringing together morphemes without
                  simultaneously. Tokenization methods like Byte         changing them in spelling or phonetics. Most Dra-
                  Pair Encoding (Sennrich et al., 2016a) are used in     vidian languages have clusivity distinction. Hindi,
                  almost all NMT models. Pivoting (Cheng et al.,         Bengali, Marathi, Gujarati, Oriya, Punjabi are Indo-
                  2017) and Transfer Learning (Zoph et al., 2016)        AryanlanguagesandareprimarilyspokeninNorth
                  have leveraged the language relatedness by indi-       and Central India and the neighboring countries
                  rectly providing the model with more parallel data     of Pakistan, Nepal, and Bangladesh. The speakers
                  from related language pairs.                           of these languages constitute around 75% of the
                                                                         Indian population. Both Dravidian and Indo-Aryan
                  2.2  Multilingual Neural Machine Translation           language families follow the Subject(S)-Object(O)-
                                                                         Verb(V) order.
                  Multilingual NMT trains a single model utilizing          Grouping languages concerning their families
                  data from multiple language-pairs to improve the       have inherent advantages because they form a
                  performance. There are different approaches to         closely related group with several linguistic phe-
                  incorporate multiple language pairs in a single        nomenonssharedamongstthem. Indo-Aryan lan-
                  system, like multi-way NMT, pivot-based NMT,           guages are morphologically rich and have huge
                  transfer learning, multi-source NMT and, multi-        similarities when compared to English. A language
                  lingual NMT (Dabre et al., 2020). Multilingual         group also share vocabularies at both word and
                  NMTcameintopicture because many languages              character level.   They contain similarly spelled
                  share certain amount of vocabulary and share some      words that are derived from the same root. ‘
                  structural similarity. These languages together can    2.4   Transliteration
                  be utilized to improve the performance of NMT          Indic languages share a lot of vocabulary, but most
                  systems. In this paper, our focus is to analyze the    languages utilize different scripts. Nevertheless,
                  performance of multi-source NMT. The simplest          these scripts have phoneme overlap and can be
                  approach is to share the parameters of NMT model       converted easily from one to another using a simple
                  across multiple language pairs. These kinds of sys-    rule-based system. To convert all Indic language
                  tems work better if languages are related to each                                                  1
                  other. In Johnson et al. (2017b), the encoder, de-     data into the same script, we use IndicNLP which
                  coder, and attention are shared for the training of    maps different Unicode range for the conversion.
                  multiple language pairs and a target language to-      Theconversion of all Indic language scripts to the
                  ken is added at the beginning of target sentence       same script helps with better shared vocabulary
                 while decoding. Firat et al. (2016) utilizes a shared   and leads to smaller subword vocabulary (Ramesh
                  attention mechanism to train multilingual models.      et al., 2021).
                  Recently many approaches have been proposed,           3    Systemoverview
                 where monolingual data of multiple languages is         In this section, we describe the details of the sub-
                  utilized to pre-train a single model using different   mitted systems to MultiIndicMT task at WAT2021.
                  objectives like masked language modeling and de-       Wereport results for four types of models:
                  noising (Lample and Conneau, 2019; Song et al.,
                  2019; Lewis et al., 2020; Liu et al., 2020). Multi-       • Bilingual: Trainedonlyusingparalleldatafor
                  lingual pre-training followed by multilingual ﬁne-          a particular language pair (bilingual models).
                  tuning has also proven to be beneﬁcial (Tang et al.,      1https://github.com/anoopkunchukuttan/
                  2020).                                                 indic_nlp_library
                                                                     218
                     • All-En: Multilingual many-to-one system              of all languages into the same script, hence the
                        trained using all available parallel data of all    choice of Devnagari as a common script is arbi-
                                                                                                     3
                        language pairs.                                     trary. We use fastBPE to learn BPE (Byte pair
                     • IA-En: Multilingual many-to-one system               encoding) (Bojanowski et al., 2017). For bilin-
                        trained using Indo-Aryan languages from the         gual models, we use 60000 BPE codes over the
                        provided parallel data.                             combined tokenized data of both languages. The
                                                                            numberofBPEcodesissetto100000forAll-En,
                     • DR-En: Multilingual many-to-one system               and 80000 for DR-En and IA-En.
                        trained using Dravidian languages from the          4.3   Experimental Setup
                        provided parallel data.
                                                                            Weusesixlayers in the encoder, six layers in the
                     Totrain our multilingual models, we use shared         decoder, 8 attention heads in both encoder and de-
                  encoder-decoder transformer architecture. To han-         coder, and 1024 embedding dimension. The en-
                  dle the lexical gap between Indic languages in mul-       coderanddecoderaretrainedusingAdam(Kingma
                  tilingual models, we convert the data of all Indic        and Ba, 2015) optimizer with inverse square root
                  languages to a common script. We choose the               learning rate schedule. We use the same setting
                  common script as Devanagari (arbitrary choice).           as used in Song et al. (2019) for warmup phase,
                  Wealso perform a comparative study of systems             in which the learning rate is increased linearly for
                  whentheencoderanddecoderaresharedonlybe-                                                        −
                                                                            some initial steps starting from 1e 7 to 0.0001,
                  tween related languages. To perform this com-             warmup phase is set to 4000 steps. We use mini-
                  parative study, we group the provided set of lan-         batches of size 2000 tokens and set the dropout
                  guages in two parts based on the language families        to 0.1 (Gal and Ghahramani, 2016). Maximum
                  they belong to, i.e, the system is trained from Indo-     sentence length is set to 100 after applying BPE.
                  Aryan (group) to English, and Dravidian (group)           At decoding time, we use greedy decoding. For
                  to English. Indo-Aryan-to-English contains Ben-                                                               4
                                                                            experiments, we are using mt steps from MASS
                  gali, Gujarati, Hindi, Marathi, Oriya, Punjabi to         codebase. Our models are trained using only par-
                  English, and Dravidian-to-English contains Kan-           allel data provided in the task, we are not training
                  nada, Malayalam, Tamil, Telugu to English. We             the model using any kind of pretraining objective.
                  use shared subword vocabulary of the languages            Wetrain bilingual models for 100 epochs and mul-
                  involved while training multilingual models, and a        tilingual models for 150 epochs. The epoch size
                  commonvocabularyofsourceandtargetlanguages                is set to 200000 sentences. Due to resource con-
                  to train bilingual models.                                straints, we train our model for ﬁxed number of
                  4    Experimental details                                 epochs, it does not guarantee convergence. Similar
                                                                            to MASS(Songetal.,2019),languageembeddings
                  4.1   Dataset                                             are added to each token in the encoder and decoder
                  Ourmodelsaretrained using only the parallel data          to distinguish between languages. These language
                  provided for the task. The size of the parallel data      embeddings are learnt during training.
                  available and its source of origin are summarized         4.4   Results and Discussion
                  in Table 1. The validation and test data provided in      Wereport BLEUscores for our four settings: bilin-
                  the task is n-way and contains 1000 sentences for         gual, All-En (multilingual many-to-one), IA-En
                  validation and 2390 sentences in test set.                (multilingual many-to-one Indo-Aryan to English),
                  4.2   Datapreprocessing                                   and DR-En (multilingual many-to-one Dravidian
                                                                            to English) in Table 2. We use multi-bleu.perl 5 to
                  We tokenize English language data using moses             calculate BLEU scores of baseline models. BLEU
                  tokenizer (Koehn et al., 2007), and Indian language       score is calculated using the tokenized reference
                  data using IndicNLP2 library. For multilingual
                  models, we transliterate (script mapping) all In-         and hypothesis ﬁles as followed by organizers in
                  dic language data into Devanagari script using the           3https://github.com/glample/fastBPE
                  IndicNLPlibrary. Our aim here is to convert data             4https://github.com/microsoft/MASS
                                                                               5https://github.com/moses-smt/
                     2https://github.com/anoopkunchukuttan/                 mosesdecoder/blob/RELEASE-2.1.1/scripts/
                  indic_nlp_library                                         generic/multi-bleu.perl
                                                                        219
                    LangPair       Size                                         Datasources
                      bn-en       1.70M              alt, cvit-pib, jw, opensubtitles, pmi, tanzil, ted2020, wikimatrix
                      gu-en       0.51M                      bibleuedin, cvit, jw, pmi, ted2020, urst, wikititles
                      hi-en       3.50M      alt, bibleuedin, cvit-pib, iitb, jw, opensubtitles, pmi, tanzil, ted2020, wikimatrix
                      kn-en       0.39M                                 bibleuedin, jw, pmi, ted2020
                      ml-en       1.20M          bibleudein, cvit-pib, jw, opensubtitles, pmi, tanzil, ted2020, wikimatrix
                      mr-en       0.78M                     bibleuedin, cvit-pib, jw, pmi, ted2020, wikimatrix
                      or-en       0.25M                            cvit, mtenglish2odia, odiencorp, pmi
                      pa-en       0.51M                                  cvit-pib, jw, pmi, ted2020
                      ta-en       1.40M     cvit-pib, jw, nlpc, opensubtitles, pmi, tanzil, ted2020, ufal, wikimatrix, wikititles
                       te-en      0.68M                    cvit-pib, jw, opensubtitles, pmi, ted2020, wikimatrix
                  Table 1: Parallel Dataset amongst 10 Indic-English language pairs. Size is the number of parallel sentences (in
                  millions). (bn, gu, hi, kn, ml, mr, or, pa, ta, te and en: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi,
                  Oriya, Punjabi, Tamil, Telugu and English respectively
                                                         BLEU                                      AMFM
                          LangPair      Bilingual    IA-En      DR-En All-En          IA-En        DR-En        All-En
                            bn-en         18.52       20.18        -       18.48     0.734491          -       0.730379
                            gu-en         26.51       31.02        -       28.79     0.776935          -       0.765441
                            hi-en         33.53        33.7        -        30.9     0.791408          -       0.775032
                            mr-en         21.28        25.5        -       23.57     0.767347          -       0.751917
                            or-en          22.6       26.34        -       25.05     0.780009          -       0.770941
                            pa-en         29.92       32.34        -       29.87     0.782112          -       0.772655
                            kn-en         17.93          -       24.18     24.01          -       0.744802     0.751223
                            ml-en         19.52          -       22.84      22.1          -       0.745908     0.744459
                            ta-en         23.62          -       22.75     21.37          -        0.74509     0.742311
                             te-en        19.89          -       24.02     22.37          -       0.745885     0.743435
                  Table 2: Results: XX-en is the translation direction. IA, DR, All are Indo-Aryan, Dravidian and All Indic lan-
                  guages respectively. The numbers under BLEU and AMFM headings represent BLEU score and AMFM score
                  respectively.
                  the evaluation of MultiIndicMT task6. Tokeniza-             The BLEU score in table 2 highlights that the
                  tion is performed using moses-tokenizer (Koehn           multilingual model outperforms the simpler bilin-
                  et al., 2007). For IA-En, DR-En, and All-En, we re-      gual models. Although we did not submit bilingual
                  port results provided by the organizers. Table 2 also    models in the shared task submission, we use it
                  reports the Adequacy-Fluency Metrics (AM-FM)             here as a baseline to compare with multilingual
                  for Machine Translation (MT) Evaluation (Banchs          models. Moreover, upon grouping languages based
                  et al., 2015) provided by organizers.                    ontheirlanguagefamilies,signiﬁcantimprovement
                                                                           in BLEUscores is observed due to less confusion
                     6http://lotus.kuee.kyoto-u.ac.jp/WAT/                 and better learning of the language representations
                  evaluation/automatic_evaluation_systems/                 in shared encoder-decoder architecture. We ob-
                  automaticEvaluationEN.html
                                                                       220
The words contained in this file might help you see if this file matches what you are looking for:

...Languagerelatednessandlexicalclosenesscanhelpimprove multilingual nmt iitbombay multiindicnmtwat jyotsana khatri nikhil saini pushpak bhattacharyya department of computer science and engineering indian institute technology bombay mumbai india jyotsanak nikhilra pb cse iitb ac in abstract this paper we present our system for multi neural machine translation has indicmt an indic language task at achieved remarkable performance by training wat nakazawaetal thetaskcovers a single model multiple lan languages bengali gujarati hindi kan guages describes submission nada malayalam marathi oriya punjabi tamil teamid cfilt forthemultiindicmt telugu english anindic to summarize approach contributions train systems i with sharing encoder decoder parameters shared framework ii show re embedding associated each sults on many one iii use transliter ken both further ation common script handle the lexical gap more demonstrate translitera between iv how grouping tion conversion guagesinregardtotheirlang...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area