121x Filetype PDF File size 0.21 MB Source: www.statmt.org
JU-Saarland Submission in the WMT2019 English–Gujarati Translation SharedTask 1,* 1,* 1,∗ Riktim Mondal ,ShankhaRajNayek ,AdityaChowdhury , 2 1 2 SantanuPal , Sudip Kumar Naskar , Josef van Genabith 1Jadavpur University, Kolkata, India 2Saarland University, Germany {riktimrules,shankharaj29,adityachowdhury21}@gmail.com {santanu.pal,josef.vangenabith}@uni-saarland.de sudip.naskar@cse.jdvu.ac.in Abstract to increase the size of the parallel training dataset. In the WMT 2019 news translation shared task, In this paper we describe our joint submission onesuchresourcescarcelanguagepairisEnglish- (JU-Saarland) from Jadavpur University and Gujarati. Due to insufficient volume of parallel Saarland University in the WMT 2019 news corporaavailabletotrainanNMTsystemforthese translation shared task for English–Gujarati language pairs, creation of more actual/synthetic language pair within the translation task sub- parallel data for low resources languages such as track. Our baseline and primary submis- sions are built using a Recurrent neural net- Gujarati, is an important issue. work (RNN) based neural machine translation In this paper, we described our joint partici- (NMT)systemwhichfollowsattentionmecha- pation of Jadavpur University and Saarland Uni- nism followed by fine-tuning using in-domain versity in the WMT 2019 news translation task data. Given the fact that the two languages be- for English–Gujarati and Gujarati–English. The long to different language families and there is released training data set is completely differ- not enough parallel data for this language pair, building a high quality NMT system for this ent in-domain compared to the development set language pair is a difficult task. We produced and the size is not anywhere close to the siz- synthetic data through back-translation from able amount of training data which is typically re- available monolingual data. We report the quired for the success of NMT systems. We use automatic evaluation scores of our English– additional synthetic data produced through back- Gujarati and Gujarati–English NMT systems translation from the monolingual corpus. This trained at word, byte-pair and character encod- provides significant improvements in translation ing levels where RNN at word level is consid- performance for both our English–Gujarati and ered as the baseline and used for comparison purpose. Our English–Gujarati system ranked Gujarati–English NMT systems. Our English– in the second position in the shared task. Gujarati system was ranked second in terms of BLEU (Papineni et al., 2002) and TER (Snover 1 Introduction et al., 2006) in the shared task. Neural Machine translation (NMT) is an ap- 2 Related Works proach to machine translation (MT) that uses artificial neural network to directly model the Dungarwal et al. (Dungarwal et al., 2014) devel- conditional probability p(y|x) of translating a oped a statistical method for machine translation, source sentence (x ,x ,...,x ) into a target sen- 1 2 n wherephrasebasedmethodforHindi-Englishand tence (y ,y ,...,y ). NMT has consistently per- 1 2 m factored based method for English-Hindi SMT formedbetter than the phrase-based statistical MT system was used. They had shown improvements (PB-SMT) approaches and has provided state-of- to the existing SMT systems using pre-procesing the-art results in the last few years. However, and post-processing components that generated one of the major constraints of using supervised morphological inflections correctly. Imankulova NMTisthatitisnotsuitable for low resource lan- et al. (Imankulova et al., 2017) showed how back- guage pairs. Thus, to use supervised NMT, low translation and filtering from monolingual data resource pairs need to resort to other techniques canbeusedtobuildaneffectivetranslationsystem ∗ These three authors have contributed equally. for a low-resourse language pair like Japanese- 308 Proceedings of the Fourth Conference on Machine Translation (WMT), Volume 2: Shared Task Papers (Day 1) pages 308–313 c Florence, Italy, August 1-2, 2019. 2019 Association for Computational Linguistics Dataset Pairs is important in the splitting part too as it is impor- Parallel Corpora 192,367 tant to choose the test and validation set from the Cleaned Parallel Corpora 64,346 same distribution and must be chosen randomly Back-translated Data 219,654 from the available data. Here, test set was also Development Data 1,998 shuffled as this dataset was used for our internal Gujarati Test Data 1,016 assessment. After cleaning, we randomly selected English Test Data 998 64,346 sentence pairs for training, 1,500 sentence pairs for validation and 1,500 sentences as test Table 1: Data Statistics of WMT 2019 English– data. It is to be noted that our validation and test Gujarati translation shared task. corpus is taken from the released parallel data to setup a baseline model. Later when WMT19 Or- Russian. Sennrich et al. (Sennrich et al., 2016a) ganizers released the development set, we contin- shown how back-translation of monolingual data ued training our models by considering WMT19 can improve the NMT system. Ramesh et development set as our test set and the new devel- al. (RameshandSankaranarayanan,2018)demon- opment set consisting of 3,000 sentences which strated how an existing model like bidirectional were obtained after combining 1,500 sentences recurrent neural network can be used to gener- from the validation and the testing set (both were ate parallel sentences for non-English languages from the parallel corpus as stated above). While like English-Tamil and English-Hindi, which be- training our final model, the released development long to low-resource language pair, to improve set was used. After cleaning it was obvious that the SMT and the NMT systems. Choudhary the amount of training data is not enough to train et al. (Choudhary et al., 2018) has shown how a neural system for such a low resource language to build NMT system for low resource paral- pair. Therefore, preparation for large volume of lel corpus language pair like English-Tamil using parallel corpus is required which can be produced techniques like word embeddings and Byte-Pair- either by manual translation by professional trans- Encoding (Sennrich et al., 2016b) to handle Out- lators or scraping parallel data from the internet. Of-Vocabulary Words. However, these processes are costly, tedious and sometimesinefficient (in case of scraping from in- 3 DataPreparation ternet). As the released data was insufficient, to gener- For our experiments we used both parallel and ate more training data, we use back-translation. monolingual corpus released by the WMT 2019 For back-translation we applied two methods, Organizers. We back-translate the monolingual first, using unsupervised statistical machine trans- corpus and use it as additional synthetic parallel lation as described in (Artetxe et al., 2018) and corpus to train our NMT system. The detailed second, using Doc translation API1 (The API uses statistics of the corpus is given in Table 1. Google translator as of April 2019). We have ex- Weperformedourexperimentsontwodatasets, plained the extraction of sentences and the corre- one using the parallel corpus provided by WMT sponding results using the above methods in sec- 2019 for the Gujarati–English news translation tion 4.2. The synthetic dataset which we have gen- shared task, and the other using the parallel cor- 2 erated can be found here. puscombinedwithbacktranslatedsentencesfrom provided monolingual corpus (only News crawl 3.1 DataPreprocessing corpus was used for back translation) for the same To train an efficient machine translation system, language pair. it is required to clean the available raw parallel Since the released parallel corpus was very corpus for the system to produce consistent and noisy, containing redundant sentences, we cleaned reliable translations. The released version of the the parallel corpus, the procedure of which is de- raw parallel corpus consisted of redundant pairs scribed in section 3.1. whichneedstoberemovedtoobtainbetterresults In the next step we shuffle the whole corpus as it reduces variance and makes sure that our model 1https://www.onlinedoctranslator.com/ en/ overfits less. We then split the dataset into three 2https://github.com/riktimmondal/ parts: training, validation and test set. Shuffling Synthetic-Data-WMT19-for-En-Gu-Language-pair 309 asdemonstratedinpreviousworks (Johnsonetal., 4.1 PrimarySystemdescription 2017) which are of types as given below: OurprimaryNMTsystemsarebasedonattention- • Thesource is same for different targets. based uni-directional RNN (Cho et al., 2014) for Gujarati–English and bi-directional RNN (Cheng • Thesource is different for the same target. et al., 2016) for English–Gujarati. • Repeated identical sentence pair hyper-parameter Value Model-type text The redundancy in the translation pairs makes Model-dtype fp32 the model prone to overfitting and hence prevents Attention-layer 2 it from recognizing new features. Thus, one of Attention-Head/layer 8 Hidden-layers 500 the sentence pair is kept while the other redun- Batch-Size 256 dant pairs are removed. Some sentence pairs had Training-steps 160,000 combinations of both language pairs which were Source vocab-size 50,000 Target vocab-size 50,000 also identified as redundant. These pairs strictly learning-rate warm-up+decay* need elimination as the vocabularies of the in- global-attention function softmax dividual languages consist of alphanumeric char- tokenization-strategy wordpiece RNN-type LSTM acters of the other language which results in in- consistent encoding and decoding during encoder- Table 2: Hyper-parameter configurations for Gujarati– decoder application steps on the considered lan- English translation using unidirectional RNN (Cho guage pair. We tokenize the English side using et al., 2014)), *learning-rate was initially set to 1.0. Moses (Koehn et al., 2007) tokenizer and for Gu- jarati, we use the Indic NLP library tokenization Table 2 shows the hyper-parameter configura- 3 tions for our Gujarati–English translation system. tool . Punctuation normalization was also done. We initially trained our model with the cleaned 3.2 DataPostprocessing parallel corpus provided by WMT 2019 up to Postprocessing, such as detokenization (Klein 100K training steps. Thereafter, we fine-tune our 4 generic model on domain specific corpus (con- et al., 2017), punctuation normalization (Koehn et al., 2007), was performed on our translated data taining 219K sentences back-translated using Doc (onthetestset)toproducethefinaltranslateddata. Translator API) changing the learning rate to 0.5 and decay started from 130K training steps with a 4 ExperimentSetup decay factor of 0.5 and keeping the other hyper- We have explained our experimental setups in parameters same as mentioned in Table 2. the next two sections. The first section con- hyper-parameter Value tains the setup used for our final submission and Model-type text the next section describes all the other support- Model-dtype fp32 ing experimental setups. We use the OpenNMT Encoder-type BRNN toolkit (Klein et al., 2017) for our experiments. Attention-layer 2 Weperformed several experiments where the par- Attention-Head/layer 8 Hidden-layers 512 allel corpus is sent to the model as space separated Batch-Size 256 character format, space separated word format, Training-steps 135,000 and space separated Byte Pair Encoding (BPE) Source vocab-size 26,859 Target vocab-size 50,000 format (Sennrich et al., 2016b). For our final learning-rate warm-up+decay (i.e., primary) submissionfortheEnglish–Gujarati global-attention function softmax task, the source input words were converted to tokenization-strategy Byte-pair Encoding RNN-type LSTM BPE whereas the Gujarati words were kept as it is. For our Gujarati–English submission, both the Table 3: Hyper-parameter configurations for English– source and the target were in simple word level Gujarati translation using bi-directional RNN (Cheng format. et al., 2016). 3http://anoopkunchukuttan.github.io/ indic_nlp_library/ To build our English–Gujarati translation sys- 4punctuation normalization.perl tem, we initially trained a generic model like our 310 Gujarati–English translation system. However, in Gujarati. The transformer model was trained until this case we use different hyper-parameter con- 100Ktraining steps, with 64 batch size in a single figurations as mentioned in Table 3. Addition- GPU and positional encoding layers size was set ally, here, we use byte-pair encoding on the En- to 2. glish side with 32K merge operations. We do Since the the training data size was not enough, not perform BPE operation on the Gujarati cor- we used backtranslation to generate additional pus; we keep the original word format for Gu- syntheticsentencepairsfromthemonolingualcor- jrati. Our generic model was trained with up to pus released in WMT 2019. We initially used 100Ktrainingsteps and then fine-tuned our model monoses (Artetxe et al., 2018), which is based on domain specific parallel corpus having English on unsupervised statistical phrase based machine side as BPE and Gujarati side as word level for- translation, to translate the monolingual sentences mat. During fine-tuning, we reduce the learning from English to Gujarati. We used 2M English rate from 1.0 to 0.25 and started decaying from sentences to train the monoses system. The train- 120K training steps with a decay factor of 0.5. ing process took around 6 days in our modest The other hyper-parameter configurations remain 64 GB server. However, the results were ex- unchanged. The respective hyperparameters used tremely poor with a BLEU score of 0.24 for for the English–Gujarati task in our primary sys- English–Gujarati and 0.01 for the opposite di- temsubmissionwerealsotestedforthereversedi- rection, without using preprocessed parallel cor- rection; however, it did not perform as good as the pus. Moreover, after adding preprocessed paral- primarysystemandhencethefinalsystemismod- lel corpus, the BLEU score dropped significantly. ified accordingly. This motivated us to use online document transla- tor, in our case Google translation API, for back- 4.2 OtherSupportingExperiments translating sentence pairs from the released mono- In this section we describe all the supporting ex- lingual dataset. The back-translated data was later periments that we performed for this shared task combined with our preprocessed parallel corpus starting from Statistical MT to NMT with both su- for our final model. pervised and unsupervised settings. Additionally, we also tried a simple unidirec- All the results and experiments discussed below tional RNN model on character level, however, are tested on the released development set (consid- this also fails to contribute in terms of improving ering this as the test set). These models were not performance. We have compiled all the results in tested with the released test set as they provided table 4. poor BLEUscores on the development set. 5 PrimarySystemResults We used uni-directional RNN having LSTM Our primary submission for English–Gujarati us- units trained on 64,346 pre-processed sentences ing bidirectional RNN model with BPE at English (cf. Section 3) with 120K training steps and learn- side (see Section 4.1) and word format at Gu- ing rate of 1.0. For English–Gujarati where in- jarati side gave the best result. On the other hand, put was space separated words for both sides, the Gujarati-English primary submission, based we achieved highest BLEU score of 4.15 after on an uni-directional RNN model with both En- fine-tuning with 10K sentences selected from the glish and Gujarati in word format, gave the best cleaned parallel corpus whose total number of to- result. Before submission, we performed punc- kens(words) was exceeding 8.The BLEU score tuation normalization, unicode normalization, and dropped to 3.56 while applying BPE on the both detokenization for each runs. Table 5 shows the sides. For the other direction (Gujarati–English) published results of our primary submissions on of the language pair, we got highest BLEU scores WMT2019Testset. Table 6 shows our hands on of 5.13 and 5.09 at word level and BPE level re- experimental results on the development set. spectively. We also tried transformer-based NMT 6 Conclusion and Future Work model (Vaswani et al., 2017) which however gave extremely poor results on similar experimen- In this paper, we applied NMT to one of the most tal settings. The highest BLEU we achieved was challenging language pair, English–Gujarati, as 0.74 for Gujarati–English and 0.96 for English– the availability of parallel corpus is really scarce 311
no reviews yet
Please Login to review.