189x Filetype PDF File size 0.27 MB Source: www.statmt.org
YandexSchoolofDataAnalysisapproachtoEnglish-Turkishtranslation at WMT16NewsTranslationTask 1,2 2 2 AntonDvorkovich ,SergeyGubanov ,andIrinaGalinskaya {dvorkanton,esgv,galinskaya}@yandex-team.ru 1 Yandex School of Data Analysis, 11/2 Timura Frunze St., Moscow 119021, Russia 2 Yandex, 16 Leo Tolstoy St., Moscow 119021, Russia Abstract For morphological segmentation and English-to-Turkish reordering we tried both We describe the English-Turkish and rule-based/supervised and fully unsupervised Turkish-English translation systems sub- approaches. mitted by Yandex School of Data Analy- sis team to WMT16 news translation task. 2 Data&commonsystemcomponents Wesuccessfullyappliedhand-craftedmor- In our two systems (Turkish-English and English- phological (de-)segmentation of Turkish, Turkish)weusedseveralcommoncomponentsde- syntax-based pre-ordering of English in scribed below. English-Turkish and post-ordering of En- Thespecificapplication of these tools varies for glish in Turkish-English. We perform de- Turkish-English and English-Turkish systems, so segmentation using SMT and propose a wediscuss it separately in Sections 4 and 3. simple yet efficient modification of post- ordering. We also show that Turkish mor- 2.1 Phrase-based translator phology and word order can be handled We used an in-house implementation of phrase- in a fully-automatic manner with only a based MT (Koehn et al., 2003) with Berkeley small loss of BLEU. Aligner (Liang et al., 2006) and MERT tuning 1 Introduction (Och, 2003). Yandex School of Data Analysis participated in 2.2 English syntactic parser WMT16 shared task ”Machine Translation of Weused an in-house transition-based English de- News”inTurkish-English language pair. pendency parser similar to (Zhang and Nivre, Machinetranslation between English and Turk- 2011). ish is a challenging task, due to the strong differ- ences between languages. In particular, Turkish 2.3 English-to-Turkish reorderers has rich agglutinative morphology, and the word Weused two different reorderers that put English order differs between languages (SOV in Turkish, words in Turkish order. Both reorderers need an SVOinEnglish). English dependency parse tree as input. To deal with these dissimilarities, we prepro- Rule-based reorderer modifies parse trees using cess both source and target parts of the parallel rules similar to Tregex (Levy and Andrew, 2006), corpus before training: we perform morphologi- adapted to dependency trees1. We used a set of cal segmentation of Turkish and reordering of En- about 70 hand-crafted rules, an example of a rule glish into Turkish word order, aiming to achieve a is given in Figure 1. monotonous one-to-one correspondence between tokens to aid SMT. w1 role ’PMOD’ Since we changed the target side of the parallel and .--> (w2 not role ’CONJ’) corpus, at runtime we had to do post-processing: :: move group w1 before node w2; desegmentation of Turkish for EN-TR and post- ordering of English words for TR-EN. We em- Figure 1: Sample dependency tree reordering rule ploy additional SMT decoders to solve both tasks, 1Our dependency tree reordering tool is available here: which results in two-stage translation. https://github.com/yandex/dep_tregex 281 Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 281–288, c Berlin, Germany, August 11-12, 2016. 2016 Association for Computational Linguistics Automatic reorderer uses word alignments on a The ”aggressive rule-based” strategy, in addi- parallel corpus to construct reference reorderings, tion, forcefully splits all features attached to the andthentrainsafeedforwardneural-networkclas- lemmaintoaseparate group. sifier which makes node-swapping decisions (de arkadaşlarına arkadaş +a3pl +p3sg +dat Gispert et al., 2015). to his friends to his friends 2.4 Turkish morphological analyzers 2.6 NMTreranker Weusedanin-housefinitestatetransducersimilar Finally, we used a sequence-to-sequence neural to (Oflazer, 1994) for Turkish morphological tag- network with attention (Bahdanau et al., 2014) as ging, and structured perceptron similar to (Sak et a feature for 100-best reranking. We used hidden al., 2007) for morphological disambiguation. layer and embedding sizes of 100, and vocabulary As an alternative, we trained our implementa- sizes of 40000 (the Turkish side was morphologi- tion of unsupervised morphology model, follow- cally segmented). ing (Soricut and Och, 2015), with a single dis- tinctive feature: in each connected component C 2.7 Data of the morphological graph, we select the lemma For training translation model, language models, as argmax (logf(w)−α·l(w)), where l(w) is C 2 and NMT reranker, we used only the provided wordlength and f(w) is word frequency . This is constrained data (SETIMES 2 parallel Turkish- a heuristic, justified by the facts, that (1) lemma English corpus, and monolingual Turkish and En- tends to be shorter than other surface forms of glish Common Crawl corpora). a word, and (2) logf(w) is proportional to l(w) Throughout our experiments, we used the (Strauss et al., 2007). We also make use of mor- BLEU (Papineni et al., 2002) on provided de- phology induction for unseen words, as described vset (news-dev2016) to estimate the performance in the original paper. The automatic method re- of our systems, tuning MERT on a random sam- quires no disambiguation and yields no part-of- ple of 1000 sentences from the SETIMES corpus speech tags or morphological features. (these sentences, to which we refer as ”the SE- 2.5 Turkish morphological segmenter TIMESsubsample”, were excluded from training data). For the final submissions, we tuned MERT We used three strategies for segmenting Turkish directly on news-dev2016. words into less-sparse units. The ”simple” strat- Due to our setup, we provide BLEU scores on egy splits a word into lemma and chain of affixes. news-dev2016 for our intermediate experiments The latter is chosen as suffix of the surface form, and on news-test2016 for our final systems. starting from (l + 1)-th letter, where l is lemma’s 3 Turkish-English system length. arkadaşlarına arkadaş $larına 3.1 Baseline to his friends to his friends For a baseline, we trained a standard phrase-based system: Berkeley Aligner (IBM Model 1 and The ”rule-based” strategy uses hand-crafted HMM, both for 5 iterations); phrase table with rules similar to (Oflazer and El-Kahlout, 2007), up to 5 tokens per phrase, 40-best translation op- (Yeniterzi and Oflazer, 2010) or (Bisazza and Fed- tions per source phrase, and Good-Turing smooth- erico, 2009) to split word into lemma and groups ing; 5-gram lowercased LM with stupid backoff of morphological features, some of which might and pruning of singleton n-grams due to memory be attached to lemma. Rules are designed to constraints; MERT on the SETIMES subsample; achieve a better correspondence between Turkish simplereorderingmodel,penalizedonlybymove- and English words. This strategy requires mor- ment distance, with distortion limit set to 16. phological analyzer to output features as well as We lowercased both the training and devel- lemma. opment corpora, taking into account Turkish ˙ arkadaşlarına arkadaş+a3pl +p3sg +dat specifics: I → ı, I → i. to his friends to his friends Baseline system achieves 10.84 uncased BLEU on news-dev2016 (here and on, we ignore case in 2Weusedα=0.6throughoutourexperiments. BLEUcomputation). 282 3 3 # Systemdescription BLEU(uncased),dev BLEU(uncased),test 1 Baseline, phrase-based 11.68 11.50 2 (1) + automatic morph., simple seg. 12.16 - 3 (1) + FST/perceptron morph., simple seg. 11.75 - 4 (1) + FST/perceptron morph., rule-based 12.93 - seg. 5 (1) + FST/perceptron morph., aggressive 14.06 - rule-based seg. 6 (5) + ”reordered” post-ordering, rule- 14.24 - based reorderer 7 (5) + ”translated” post-ordering, rule- 15.13 - based reorderer 8 (2) + ”translated” post-ordering, auto- 13.43 13.39 matic reorderer 9 (7) + NMT reranking in first stage 15.49 15.12 Table 1: Our TR-EN setups on news-dev2016 and news-test2016 (submitted system in bold) 3.2 Morphological segmentation mightstilltranslate the unseen wordformcor- In Turkish-to-English translator we directly ap- rectly. plied Turkish morphological segmenters (see Sec- • An excessive segmentation does not really tion 2.5) as an initial step in the pipeline (Oflazer hurt a phrase-based system, as shown by and El-Kahlout, 2007; Bisazza and Federico, (Chang et al., 2008). 2009). The effect of different morphological tagging 3.3 Post-ordering and segmentation methods is shown in Table 1. It is not possible to directly apply English-to- FST/perceptron analyzer with aggressive rule- Turkish reorderer as a preprocessing step in this based segmentation (run #5) turned out to be the translation direction, and we also counld not con- most successful method, bringing +2.60 BLEU. struct a Turkish-to-English reorderer (due to the Our segmenters split Turkish words into lem- absence of Turkish parser). mas and auxiliary tokens like $ini or +a3sg. Instead, we reordered the target side of the par- To account for the increased number of tokens on allel corpus on the training phase using the rule- Turkish side, we increased the length of a target based reorderer described in Section 2.3, and em- phrase from 5 to 10 (but still allowing only up to ployed a second-stage translator to restore English 5 non-auxiliary tokens in a phrase). In order to word order at runtime, following (Sudoh et al., further decrease sparsity we also removed all di- 2011). acritics from the intermediate segmented Turkish. As shown in Figure 2, the first, ”monotonous Possible ambiguity in translations, caused by this, translation” stage is trained to translate from Turk- is handled by English LM. ish to English that was reordered to the Turkish or- For a rule-based segmentation we note that it 4 is beneficial to aggressively separate away lemma der , and the second, ”reordering” stage is trained and morphological features that would normally to translate from reordered English to normal En- be attached to it (that is, if we acted according to glish, relying on the LM and baseline reordering the rules). We think the reason for this is the pres- inside the phrase-based decoder. ence of errors and non-optimal decisions in our 3Wetune on the SETIMES subsample for ”dev” column, segmentation rules, but we still consider the extra and on news-dev2016 for ”test” column. So the same line lists the results for two sets of MERT coefficients. split helpful: 4This does not mean we completely disable the base- line reordering mechanism in the decoder on this stage; that • If we do the extra split, a wordform is seg- wouldhavemadesenseonlyif(a)ourEnglish-to-Turkish re- mentedintoalemmaandseveralauxiliaryto- orderer was perfect and (b) if the two languages could be per- fectly aligned using just word reordering. Obviously, neither kens, so if we have seen just the lemma, we of those is the case. 283 Turkish English English As shown in Table 1, the best results are (in Turkish order) achieved using ”translated Turkish” for training the second-stage translator, yielding an additional +1.60 BLEU. MT stage 1 MT stage 2 3.4 NMTreranking Finally, we enhanced the first-stage translator with a 100-best reranking which uses decoder features and a neural sequence-to-sequence network de- Figure 2: Two-stage post-ordering scribed in Section 2.6. To train the network, we used the same corpus used to train the first-stage Figures 3 and 4 illustrate the training of two- PBMTtranslator (incorporating Turkish segmen- stage postordering systems. We explore two op- tation and English reordering). tions for the training of the second, ”reordering” NMT reranking yields an additional +0.47 stage: as the source-side, we can either use (a) the BLEUscore. reordered English sentences, or (b) Turkish sen- 3.5 Final system tences translated to reordered English with first- stage translator. The complete pipeline of our submitted system is English showninFigure 5. Turkish (in Turkish order) English Weselected the setup that performed best dur- ing experiments (#9 in Table 1), and re-tuned it on Reorder the development set; for contrastive runs we also re-tuned baseline and ”fully automatic” systems MT stage 1 (#1 and #8 respectively). See Table 1 for results. Ourbest setup reaches 15.17 BLEU, which is a +3.17 BLEUimprovementoverthebaseline. The system without the hand-crafted rules achieves a lower improvement of +1.89 BLEU, Figure 3: Training the ”monotonous translation” whichisanicegainnevertheless. Comparing runs stage of post-ordering system #2and#3,weseethatthedecreaseinBLEUisnot duetothequalityofmorphologicalanalysis; com- Turkish English English paringruns#3and#5,weseethatthedifferencein (in Turkish order) quality is purely due to the segmentation scheme. (b) Translate using stage 1 4 English-Turkish system (a) Reorder MT stage 2 4.1 Baseline As a baseline, we trained the same phrase-based system as in Section 3.1 (except we did not prune singleton n-grams in the Turkish language model). Figure 4: Two options for training the ”reorder- Baseline system achieves 8.51 uncased BLEU ing” stage of post-ordering system on news-dev2016. The two decoders have two sets of MERT co- 4.2 Pre-ordering efficients. We tune them jointly and iteratively: We directly apply English-to-Turkish reorderers first, we tune the first-stage decoder (with second- described in Section 2.3 as a pre-processing step stage coefficients fixed), optimizing BLEU of the in the phrase-based MT pipeline, like e.g. (Xia whole-system output, then we tune the second- and McCord, 2004; Collins et al., 2005). Results stage decoder (with first-stage coefficients fixed), are shown in Table 2 again optimizing the whole-system BLEU, and so The rule-based reorderer earns +1.65 BLEU on. against the baseline (run #2), so we selected it as a 284
no reviews yet
Please Login to review.