Language Pdf 99268 | Wmt32 Item Download 2022-09-21 14-31-03

Partial capture of text on file.
                    JU-Saarland Submission in the WMT2019 English–Gujarati Translation
                                                                  SharedTask
                                                       1,*                           1,*                          1,∗
                                   Riktim Mondal ,ShankhaRajNayek ,AdityaChowdhury ,
                                                      2                             1                           2
                                       SantanuPal , Sudip Kumar Naskar , Josef van Genabith
                                                      1Jadavpur University, Kolkata, India
                                                         2Saarland University, Germany
                           {riktimrules,shankharaj29,adityachowdhury21}@gmail.com
                                  {santanu.pal,josef.vangenabith}@uni-saarland.de
                                                   sudip.naskar@cse.jdvu.ac.in
                                        Abstract                             to increase the size of the parallel training dataset.
                                                                             In the WMT 2019 news translation shared task,
                       In this paper we describe our joint submission        onesuchresourcescarcelanguagepairisEnglish-
                       (JU-Saarland) from Jadavpur University and            Gujarati. Due to insufﬁcient volume of parallel
                       Saarland University in the WMT 2019 news              corporaavailabletotrainanNMTsystemforthese
                       translation shared task for English–Gujarati          language pairs, creation of more actual/synthetic
                       language pair within the translation task sub-        parallel data for low resources languages such as
                       track.   Our baseline and primary submis-
                       sions are built using a Recurrent neural net-         Gujarati, is an important issue.
                       work (RNN) based neural machine translation              In this paper, we described our joint partici-
                       (NMT)systemwhichfollowsattentionmecha-                pation of Jadavpur University and Saarland Uni-
                       nism followed by ﬁne-tuning using in-domain           versity in the WMT 2019 news translation task
                       data. Given the fact that the two languages be-       for English–Gujarati and Gujarati–English. The
                       long to different language families and there is      released training data set is completely differ-
                       not enough parallel data for this language pair,
                       building a high quality NMT system for this           ent in-domain compared to the development set
                       language pair is a difﬁcult task. We produced         and the size is not anywhere close to the siz-
                       synthetic data through back-translation from          able amount of training data which is typically re-
                       available monolingual data.    We report the          quired for the success of NMT systems. We use
                       automatic evaluation scores of our English–           additional synthetic data produced through back-
                       Gujarati and Gujarati–English NMT systems             translation from the monolingual corpus.          This
                       trained at word, byte-pair and character encod-       provides signiﬁcant improvements in translation
                       ing levels where RNN at word level is consid-         performance for both our English–Gujarati and
                       ered as the baseline and used for comparison
                       purpose. Our English–Gujarati system ranked           Gujarati–English NMT systems.           Our English–
                       in the second position in the shared task.            Gujarati system was ranked second in terms of
                                                                             BLEU (Papineni et al., 2002) and TER (Snover
                   1   Introduction                                          et al., 2006) in the shared task.
                   Neural Machine translation (NMT) is an ap-                2    Related Works
                   proach to machine translation (MT) that uses
                   artiﬁcial neural network to directly model the            Dungarwal et al. (Dungarwal et al., 2014) devel-
                   conditional probability p(y|x) of translating a           oped a statistical method for machine translation,
                   source sentence (x ,x ,...,x ) into a target sen-
                                        1  2      n                          wherephrasebasedmethodforHindi-Englishand
                   tence (y ,y ,...,y ).   NMT has consistently per-
                            1  2     m                                       factored based method for English-Hindi SMT
                   formedbetter than the phrase-based statistical MT         system was used. They had shown improvements
                   (PB-SMT) approaches and has provided state-of-            to the existing SMT systems using pre-procesing
                   the-art results in the last few years.      However,      and post-processing components that generated
                   one of the major constraints of using supervised          morphological inﬂections correctly. Imankulova
                   NMTisthatitisnotsuitable for low resource lan-            et al. (Imankulova et al., 2017) showed how back-
                   guage pairs. Thus, to use supervised NMT, low             translation and ﬁltering from monolingual data
                   resource pairs need to resort to other techniques         canbeusedtobuildaneffectivetranslationsystem
                       ∗
                       These three authors have contributed equally.         for a low-resourse language pair like Japanese-
                                                                         308
                  Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers (Day 1) pages 308–313
                                                                    c
                                     Florence, Italy, August 1-2, 2019. 
2019 Association for Computational Linguistics
                       Dataset                       Pairs           is important in the splitting part too as it is impor-
                       Parallel Corpora           192,367            tant to choose the test and validation set from the
                       Cleaned Parallel Corpora     64,346           same distribution and must be chosen randomly
                       Back-translated Data       219,654            from the available data. Here, test set was also
                       Development Data              1,998           shufﬂed as this dataset was used for our internal
                       Gujarati Test Data            1,016           assessment. After cleaning, we randomly selected
                       English Test Data               998           64,346 sentence pairs for training, 1,500 sentence
                                                                     pairs for validation and 1,500 sentences as test
                Table 1:   Data Statistics of WMT 2019 English–      data. It is to be noted that our validation and test
                Gujarati translation shared task.                    corpus is taken from the released parallel data to
                                                                     setup a baseline model. Later when WMT19 Or-
                Russian. Sennrich et al. (Sennrich et al., 2016a)    ganizers released the development set, we contin-
                shown how back-translation of monolingual data       ued training our models by considering WMT19
                can improve the NMT system.           Ramesh et      development set as our test set and the new devel-
                al. (RameshandSankaranarayanan,2018)demon-           opment set consisting of 3,000 sentences which
                strated how an existing model like bidirectional     were obtained after combining 1,500 sentences
                recurrent neural network can be used to gener-       from the validation and the testing set (both were
                ate parallel sentences for non-English languages     from the parallel corpus as stated above). While
                like English-Tamil and English-Hindi, which be-      training our ﬁnal model, the released development
                long to low-resource language pair, to improve       set was used. After cleaning it was obvious that
                the SMT and the NMT systems.          Choudhary      the amount of training data is not enough to train
                et al. (Choudhary et al., 2018) has shown how        a neural system for such a low resource language
                to build NMT system for low resource paral-          pair. Therefore, preparation for large volume of
                lel corpus language pair like English-Tamil using    parallel corpus is required which can be produced
                techniques like word embeddings and Byte-Pair-       either by manual translation by professional trans-
                Encoding (Sennrich et al., 2016b) to handle Out-     lators or scraping parallel data from the internet.
                Of-Vocabulary Words.                                 However, these processes are costly, tedious and
                                                                     sometimesinefﬁcient (in case of scraping from in-
                3   DataPreparation                                  ternet).
                                                                       As the released data was insufﬁcient, to gener-
                For our experiments we used both parallel and        ate more training data, we use back-translation.
                monolingual corpus released by the WMT 2019          For back-translation we applied two methods,
                Organizers.   We back-translate the monolingual      ﬁrst, using unsupervised statistical machine trans-
                corpus and use it as additional synthetic parallel   lation as described in (Artetxe et al., 2018) and
                corpus to train our NMT system. The detailed         second, using Doc translation API1 (The API uses
                statistics of the corpus is given in Table 1.        Google translator as of April 2019). We have ex-
                   Weperformedourexperimentsontwodatasets,           plained the extraction of sentences and the corre-
                one using the parallel corpus provided by WMT        sponding results using the above methods in sec-
                2019 for the Gujarati–English news translation       tion 4.2. The synthetic dataset which we have gen-
                shared task, and the other using the parallel cor-                           2
                                                                     erated can be found here.
                puscombinedwithbacktranslatedsentencesfrom
                provided monolingual corpus (only News crawl         3.1  DataPreprocessing
                corpus was used for back translation) for the same   To train an efﬁcient machine translation system,
                language pair.                                       it is required to clean the available raw parallel
                   Since the released parallel corpus was very       corpus for the system to produce consistent and
                noisy, containing redundant sentences, we cleaned    reliable translations. The released version of the
                the parallel corpus, the procedure of which is de-   raw parallel corpus consisted of redundant pairs
                scribed in section 3.1.                              whichneedstoberemovedtoobtainbetterresults
                   In the next step we shufﬂe the whole corpus as
                it reduces variance and makes sure that our model      1https://www.onlinedoctranslator.com/
                                                                     en/
                overﬁts less. We then split the dataset into three     2https://github.com/riktimmondal/
                parts: training, validation and test set. Shufﬂing   Synthetic-Data-WMT19-for-En-Gu-Language-pair
                                                                 309
                 asdemonstratedinpreviousworks (Johnsonetal.,           4.1   PrimarySystemdescription
                 2017) which are of types as given below:               OurprimaryNMTsystemsarebasedonattention-
                    • Thesource is same for different targets.          based uni-directional RNN (Cho et al., 2014) for
                                                                        Gujarati–English and bi-directional RNN (Cheng
                    • Thesource is different for the same target.       et al., 2016) for English–Gujarati.
                    • Repeated identical sentence pair                         hyper-parameter             Value
                                                                               Model-type                   text
                    The redundancy in the translation pairs makes              Model-dtype                  fp32
                 the model prone to overﬁtting and hence prevents              Attention-layer               2
                 it from recognizing new features. Thus, one of                Attention-Head/layer          8
                                                                               Hidden-layers                500
                 the sentence pair is kept while the other redun-              Batch-Size                   256
                 dant pairs are removed. Some sentence pairs had               Training-steps             160,000
                 combinations of both language pairs which were                Source vocab-size           50,000
                                                                               Target vocab-size           50,000
                 also identiﬁed as redundant. These pairs strictly             learning-rate          warm-up+decay*
                 need elimination as the vocabularies of the in-               global-attention function  softmax
                 dividual languages consist of alphanumeric char-              tokenization-strategy     wordpiece
                                                                               RNN-type                    LSTM
                 acters of the other language which results in in-
                 consistent encoding and decoding during encoder-       Table 2: Hyper-parameter conﬁgurations for Gujarati–
                 decoder application steps on the considered lan-       English translation using unidirectional RNN   (Cho
                 guage pair. We tokenize the English side using         et al., 2014)), *learning-rate was initially set to 1.0.
                 Moses (Koehn et al., 2007) tokenizer and for Gu-
                 jarati, we use the Indic NLP library tokenization         Table 2 shows the hyper-parameter conﬁgura-
                     3                                                  tions for our Gujarati–English translation system.
                 tool . Punctuation normalization was also done.
                                                                        We initially trained our model with the cleaned
                 3.2   DataPostprocessing                               parallel corpus provided by WMT 2019 up to
                 Postprocessing, such as detokenization       (Klein    100K training steps. Thereafter, we ﬁne-tune our
                                                           4            generic model on domain speciﬁc corpus (con-
                 et al., 2017), punctuation normalization (Koehn
                 et al., 2007), was performed on our translated data    taining 219K sentences back-translated using Doc
                 (onthetestset)toproducetheﬁnaltranslateddata.          Translator API) changing the learning rate to 0.5
                                                                        and decay started from 130K training steps with a
                 4    ExperimentSetup                                   decay factor of 0.5 and keeping the other hyper-
                 We have explained our experimental setups in           parameters same as mentioned in Table 2.
                 the next two sections.      The ﬁrst section con-            hyper-parameter              Value
                 tains the setup used for our ﬁnal submission and             Model-type                    text
                 the next section describes all the other support-            Model-dtype                   fp32
                 ing experimental setups. We use the OpenNMT                  Encoder-type                 BRNN
                 toolkit (Klein et al., 2017) for our experiments.            Attention-layer                2
                 Weperformed several experiments where the par-               Attention-Head/layer           8
                                                                              Hidden-layers                 512
                 allel corpus is sent to the model as space separated         Batch-Size                    256
                 character format, space separated word format,               Training-steps              135,000
                 and space separated Byte Pair Encoding (BPE)                 Source vocab-size            26,859
                                                                              Target vocab-size            50,000
                 format (Sennrich et al., 2016b).     For our ﬁnal            learning-rate            warm-up+decay
                 (i.e., primary) submissionfortheEnglish–Gujarati             global-attention function   softmax
                 task, the source input words were converted to               tokenization-strategy  Byte-pair Encoding
                                                                              RNN-type                     LSTM
                 BPE whereas the Gujarati words were kept as it
                 is. For our Gujarati–English submission, both the      Table 3: Hyper-parameter conﬁgurations for English–
                 source and the target were in simple word level        Gujarati translation using bi-directional RNN (Cheng
                 format.                                                et al., 2016).
                    3http://anoopkunchukuttan.github.io/
                 indic_nlp_library/                                        To build our English–Gujarati translation sys-
                    4punctuation normalization.perl                     tem, we initially trained a generic model like our
                                                                    310
                      Gujarati–English translation system. However, in                     Gujarati. The transformer model was trained until
                      this case we use different hyper-parameter con-                      100Ktraining steps, with 64 batch size in a single
                      ﬁgurations as mentioned in Table 3.                 Addition-        GPU and positional encoding layers size was set
                      ally, here, we use byte-pair encoding on the En-                     to 2.
                      glish side with 32K merge operations.                  We do             Since the the training data size was not enough,
                      not perform BPE operation on the Gujarati cor-                       we used backtranslation to generate additional
                      pus; we keep the original word format for Gu-                        syntheticsentencepairsfromthemonolingualcor-
                      jrati.  Our generic model was trained with up to                     pus released in WMT 2019. We initially used
                      100Ktrainingsteps and then ﬁne-tuned our model                       monoses (Artetxe et al., 2018), which is based
                      on domain speciﬁc parallel corpus having English                     on unsupervised statistical phrase based machine
                      side as BPE and Gujarati side as word level for-                     translation, to translate the monolingual sentences
                      mat. During ﬁne-tuning, we reduce the learning                       from English to Gujarati. We used 2M English
                      rate from 1.0 to 0.25 and started decaying from                      sentences to train the monoses system. The train-
                      120K training steps with a decay factor of 0.5.                      ing process took around 6 days in our modest
                      The other hyper-parameter conﬁgurations remain                       64 GB server.          However, the results were ex-
                      unchanged. The respective hyperparameters used                       tremely poor with a BLEU score of 0.24 for
                      for the English–Gujarati task in our primary sys-                    English–Gujarati and 0.01 for the opposite di-
                      temsubmissionwerealsotestedforthereversedi-                          rection, without using preprocessed parallel cor-
                      rection; however, it did not perform as good as the                  pus. Moreover, after adding preprocessed paral-
                      primarysystemandhencetheﬁnalsystemismod-                             lel corpus, the BLEU score dropped signiﬁcantly.
                      iﬁed accordingly.                                                    This motivated us to use online document transla-
                                                                                           tor, in our case Google translation API, for back-
                      4.2    OtherSupportingExperiments                                    translating sentence pairs from the released mono-
                      In this section we describe all the supporting ex-                   lingual dataset. The back-translated data was later
                      periments that we performed for this shared task                     combined with our preprocessed parallel corpus
                      starting from Statistical MT to NMT with both su-                    for our ﬁnal model.
                      pervised and unsupervised settings.                                      Additionally, we also tried a simple unidirec-
                         All the results and experiments discussed below                   tional RNN model on character level, however,
                      are tested on the released development set (consid-                  this also fails to contribute in terms of improving
                      ering this as the test set). These models were not                   performance. We have compiled all the results in
                      tested with the released test set as they provided                   table 4.
                      poor BLEUscores on the development set.                              5     PrimarySystemResults
                         We used uni-directional RNN having LSTM                           Our primary submission for English–Gujarati us-
                      units trained on 64,346 pre-processed sentences                      ing bidirectional RNN model with BPE at English
                      (cf. Section 3) with 120K training steps and learn-                  side (see Section 4.1) and word format at Gu-
                      ing rate of 1.0. For English–Gujarati where in-                      jarati side gave the best result. On the other hand,
                      put was space separated words for both sides,                        the Gujarati-English primary submission, based
                      we achieved highest BLEU score of 4.15 after                         on an uni-directional RNN model with both En-
                      ﬁne-tuning with 10K sentences selected from the                      glish and Gujarati in word format, gave the best
                      cleaned parallel corpus whose total number of to-                    result.    Before submission, we performed punc-
                      kens(words) was exceeding 8.The BLEU score                           tuation normalization, unicode normalization, and
                      dropped to 3.56 while applying BPE on the both                       detokenization for each runs. Table 5 shows the
                      sides. For the other direction (Gujarati–English)                    published results of our primary submissions on
                      of the language pair, we got highest BLEU scores                     WMT2019Testset. Table 6 shows our hands on
                      of 5.13 and 5.09 at word level and BPE level re-                     experimental results on the development set.
                      spectively.
                         We also          tried     transformer-based          NMT         6     Conclusion and Future Work
                      model (Vaswani et al., 2017) which however
                      gave extremely poor results on similar experimen-                    In this paper, we applied NMT to one of the most
                      tal settings. The highest BLEU we achieved was                       challenging language pair, English–Gujarati, as
                      0.74 for Gujarati–English and 0.96 for English–                      the availability of parallel corpus is really scarce
                                                                                      311
The words contained in this file might help you see if this file matches what you are looking for:

...Ju saarland submission in the wmt english gujarati translation sharedtask riktim mondal shankharajnayek adityachowdhury santanupal sudip kumar naskar josef van genabith jadavpur university kolkata india germany riktimrules shankharaj gmail com santanu pal vangenabith uni de cse jdvu ac abstract to increase size of parallel training dataset news shared task this paper we describe our joint onesuchresourcescarcelanguagepairisenglish from and due insufcient volume corporaavailabletotrainannmtsystemforthese for language pairs creation more actual synthetic pair within sub data low resources languages such as track baseline primary submis sions are built using a recurrent neural net is an important issue work rnn based machine described partici nmt systemwhichfollowsattentionmecha pation nism followed by ne tuning domain versity given fact that two be long different families there released set completely differ not enough building high quality system ent compared development difcult produce...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area