Syntax Pdf 103601

Partial capture of text on file.
                 YandexSchoolofDataAnalysisapproachtoEnglish-Turkishtranslation
                                           at WMT16NewsTranslationTask
                                                     1,2                   2                          2
                                AntonDvorkovich ,SergeyGubanov ,andIrinaGalinskaya
                                 {dvorkanton,esgv,galinskaya}@yandex-team.ru
                      1 Yandex School of Data Analysis, 11/2 Timura Frunze St., Moscow 119021, Russia
                                      2 Yandex, 16 Leo Tolstoy St., Moscow 119021, Russia
                                    Abstract                           For    morphological      segmentation     and
                                                                     English-to-Turkish  reordering  we tried both
                    We describe the English-Turkish and              rule-based/supervised  and fully unsupervised
                    Turkish-English translation systems sub-         approaches.
                    mitted by Yandex School of Data Analy-
                    sis team to WMT16 news translation task.         2   Data&commonsystemcomponents
                    Wesuccessfullyappliedhand-craftedmor-            In our two systems (Turkish-English and English-
                    phological (de-)segmentation of Turkish,         Turkish)weusedseveralcommoncomponentsde-
                    syntax-based pre-ordering of English in          scribed below.
                    English-Turkish and post-ordering of En-           Thespeciﬁcapplication of these tools varies for
                    glish in Turkish-English. We perform de-         Turkish-English and English-Turkish systems, so
                    segmentation using SMT and propose a             wediscuss it separately in Sections 4 and 3.
                    simple yet efﬁcient modiﬁcation of post-
                    ordering. We also show that Turkish mor-         2.1  Phrase-based translator
                    phology and word order can be handled            We used an in-house implementation of phrase-
                    in a fully-automatic manner with only a          based MT (Koehn et al., 2003) with Berkeley
                    small loss of BLEU.                              Aligner (Liang et al., 2006) and MERT tuning
                1   Introduction                                     (Och, 2003).
                Yandex School of Data Analysis participated in       2.2  English syntactic parser
                WMT16 shared task ”Machine Translation of            Weused an in-house transition-based English de-
                News”inTurkish-English language pair.                pendency parser similar to (Zhang and Nivre,
                   Machinetranslation between English and Turk-      2011).
                ish is a challenging task, due to the strong differ-
                ences between languages. In particular, Turkish      2.3  English-to-Turkish reorderers
                has rich agglutinative morphology, and the word      Weused two different reorderers that put English
                order differs between languages (SOV in Turkish,     words in Turkish order. Both reorderers need an
                SVOinEnglish).                                       English dependency parse tree as input.
                   To deal with these dissimilarities, we prepro-      Rule-based reorderer modiﬁes parse trees using
                cess both source and target parts of the parallel    rules similar to Tregex (Levy and Andrew, 2006),
                corpus before training: we perform morphologi-       adapted to dependency trees1. We used a set of
                cal segmentation of Turkish and reordering of En-    about 70 hand-crafted rules, an example of a rule
                glish into Turkish word order, aiming to achieve a   is given in Figure 1.
                monotonous one-to-one correspondence between
                tokens to aid SMT.                                      w1 role ’PMOD’
                   Since we changed the target side of the parallel        and .--> (w2 not role ’CONJ’)
                corpus, at runtime we had to do post-processing:        ::
                                                                        move group w1 before node w2;
                desegmentation of Turkish for EN-TR and post-
                ordering of English words for TR-EN. We em-          Figure 1: Sample dependency tree reordering rule
                ploy additional SMT decoders to solve both tasks,      1Our dependency tree reordering tool is available here:
                which results in two-stage translation.              https://github.com/yandex/dep_tregex
                                                                 281
                       Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 281–288,
                                                               c
                               Berlin, Germany, August 11-12, 2016. 
2016 Association for Computational Linguistics
                       Automatic reorderer uses word alignments on a                     The ”aggressive rule-based” strategy, in addi-
                    parallel corpus to construct reference reorderings,               tion, forcefully splits all features attached to the
                    andthentrainsafeedforwardneural-networkclas-                      lemmaintoaseparate group.
                    siﬁer which makes node-swapping decisions (de                        arkadaşlarına         arkadaş  +a3pl  +p3sg  +dat
                    Gispert et al., 2015).                                              to  his  friends             to  his  friends
                    2.4    Turkish morphological analyzers                            2.6    NMTreranker
                    Weusedanin-houseﬁnitestatetransducersimilar                       Finally, we used a sequence-to-sequence neural
                    to (Oﬂazer, 1994) for Turkish morphological tag-                  network with attention (Bahdanau et al., 2014) as
                    ging, and structured perceptron similar to (Sak et                a feature for 100-best reranking. We used hidden
                    al., 2007) for morphological disambiguation.                      layer and embedding sizes of 100, and vocabulary
                       As an alternative, we trained our implementa-                  sizes of 40000 (the Turkish side was morphologi-
                    tion of unsupervised morphology model, follow-                    cally segmented).
                    ing (Soricut and Och, 2015), with a single dis-
                    tinctive feature: in each connected component C                   2.7    Data
                    of the morphological graph, we select the lemma                   For training translation model, language models,
                    as argmax (logf(w)−α·l(w)), where l(w) is
                                  C                                   2               and NMT reranker, we used only the provided
                    wordlength and f(w) is word frequency . This is                   constrained data (SETIMES 2 parallel Turkish-
                    a heuristic, justiﬁed by the facts, that (1) lemma                English corpus, and monolingual Turkish and En-
                    tends to be shorter than other surface forms of                   glish Common Crawl corpora).
                    a word, and (2) logf(w) is proportional to l(w)                      Throughout our experiments, we used the
                    (Strauss et al., 2007). We also make use of mor-                  BLEU (Papineni et al., 2002) on provided de-
                    phology induction for unseen words, as described                  vset (news-dev2016) to estimate the performance
                    in the original paper. The automatic method re-                   of our systems, tuning MERT on a random sam-
                    quires no disambiguation and yields no part-of-                   ple of 1000 sentences from the SETIMES corpus
                    speech tags or morphological features.                            (these sentences, to which we refer as ”the SE-
                    2.5    Turkish morphological segmenter                            TIMESsubsample”, were excluded from training
                                                                                      data). For the ﬁnal submissions, we tuned MERT
                    We used three strategies for segmenting Turkish                   directly on news-dev2016.
                    words into less-sparse units. The ”simple” strat-                    Due to our setup, we provide BLEU scores on
                    egy splits a word into lemma and chain of afﬁxes.                 news-dev2016 for our intermediate experiments
                    The latter is chosen as sufﬁx of the surface form,                and on news-test2016 for our ﬁnal systems.
                    starting from (l + 1)-th letter, where l is lemma’s               3    Turkish-English system
                    length.
                                 arkadaşlarına     arkadaş  $larına                   3.1    Baseline
                                to  his  friends    to  his  friends                  For a baseline, we trained a standard phrase-based
                                                                                      system:     Berkeley Aligner (IBM Model 1 and
                       The ”rule-based” strategy uses hand-crafted                    HMM, both for 5 iterations); phrase table with
                    rules similar to (Oﬂazer and El-Kahlout, 2007),                   up to 5 tokens per phrase, 40-best translation op-
                    (Yeniterzi and Oﬂazer, 2010) or (Bisazza and Fed-                 tions per source phrase, and Good-Turing smooth-
                    erico, 2009) to split word into lemma and groups                  ing; 5-gram lowercased LM with stupid backoff
                    of morphological features, some of which might                    and pruning of singleton n-grams due to memory
                    be attached to lemma.            Rules are designed to            constraints; MERT on the SETIMES subsample;
                    achieve a better correspondence between Turkish                   simplereorderingmodel,penalizedonlybymove-
                    and English words. This strategy requires mor-                    ment distance, with distortion limit set to 16.
                    phological analyzer to output features as well as                    We lowercased both the training and devel-
                    lemma.                                                            opment corpora, taking into account Turkish
                                                                                                          ˙
                       arkadaşlarına          arkadaş+a3pl  +p3sg  +dat               speciﬁcs: I → ı, I → i.
                       to  his  friends              to  his  friends                    Baseline system achieves 10.84 uncased BLEU
                                                                                      on news-dev2016 (here and on, we ignore case in
                        2Weusedα=0.6throughoutourexperiments.                         BLEUcomputation).
                                                                                 282
                                                                                              3                           3
                    #    Systemdescription                             BLEU(uncased),dev           BLEU(uncased),test
                    1    Baseline, phrase-based                                 11.68                       11.50
                    2    (1) + automatic morph., simple seg.                    12.16                         -
                    3    (1) + FST/perceptron morph., simple seg.               11.75                         -
                    4    (1) + FST/perceptron morph., rule-based                12.93                         -
                         seg.
                    5    (1) + FST/perceptron morph., aggressive                14.06                         -
                         rule-based seg.
                    6    (5) + ”reordered” post-ordering, rule-                 14.24                         -
                         based reorderer
                    7    (5) + ”translated” post-ordering, rule-                15.13                         -
                         based reorderer
                    8    (2) + ”translated” post-ordering, auto-                13.43                       13.39
                         matic reorderer
                    9    (7) + NMT reranking in ﬁrst stage                      15.49                       15.12
                        Table 1: Our TR-EN setups on news-dev2016 and news-test2016 (submitted system in bold)
                  3.2   Morphological segmentation                             mightstilltranslate the unseen wordformcor-
                  In Turkish-to-English translator we directly ap-             rectly.
                  plied Turkish morphological segmenters (see Sec-           • An excessive segmentation does not really
                  tion 2.5) as an initial step in the pipeline (Oﬂazer         hurt a phrase-based system, as shown by
                  and El-Kahlout, 2007; Bisazza and Federico,                  (Chang et al., 2008).
                  2009).
                    The effect of different morphological tagging         3.3   Post-ordering
                  and segmentation methods is shown in Table 1.           It is not possible to directly apply English-to-
                    FST/perceptron analyzer with aggressive rule-         Turkish reorderer as a preprocessing step in this
                  based segmentation (run #5) turned out to be the        translation direction, and we also counld not con-
                  most successful method, bringing +2.60 BLEU.            struct a Turkish-to-English reorderer (due to the
                    Our segmenters split Turkish words into lem-          absence of Turkish parser).
                  mas and auxiliary tokens like $ini or +a3sg.              Instead, we reordered the target side of the par-
                  To account for the increased number of tokens on        allel corpus on the training phase using the rule-
                  Turkish side, we increased the length of a target       based reorderer described in Section 2.3, and em-
                  phrase from 5 to 10 (but still allowing only up to      ployed a second-stage translator to restore English
                  5 non-auxiliary tokens in a phrase). In order to        word order at runtime, following (Sudoh et al.,
                  further decrease sparsity we also removed all di-       2011).
                  acritics from the intermediate segmented Turkish.         As shown in Figure 2, the ﬁrst, ”monotonous
                  Possible ambiguity in translations, caused by this,     translation” stage is trained to translate from Turk-
                  is handled by English LM.                               ish to English that was reordered to the Turkish or-
                    For a rule-based segmentation we note that it            4
                  is beneﬁcial to aggressively separate away lemma        der , and the second, ”reordering” stage is trained
                  and morphological features that would normally          to translate from reordered English to normal En-
                  be attached to it (that is, if we acted according to    glish, relying on the LM and baseline reordering
                  the rules). We think the reason for this is the pres-   inside the phrase-based decoder.
                  ence of errors and non-optimal decisions in our            3Wetune on the SETIMES subsample for ”dev” column,
                  segmentation rules, but we still consider the extra     and on news-dev2016 for ”test” column. So the same line
                                                                          lists the results for two sets of MERT coefﬁcients.
                  split helpful:                                             4This does not mean we completely disable the base-
                                                                          line reordering mechanism in the decoder on this stage; that
                    • If we do the extra split, a wordform is seg-        wouldhavemadesenseonlyif(a)ourEnglish-to-Turkish re-
                       mentedintoalemmaandseveralauxiliaryto-             orderer was perfect and (b) if the two languages could be per-
                                                                          fectly aligned using just word reordering. Obviously, neither
                       kens, so if we have seen just the lemma, we        of those is the case.
                                                                      283
                   Turkish             English               English        As shown in Table 1, the best results are
                                  (in Turkish order)                      achieved using ”translated Turkish” for training
                                                                          the second-stage translator, yielding an additional
                                                                          +1.60 BLEU.
                          MT stage 1           MT stage 2                 3.4   NMTreranking
                                                                          Finally, we enhanced the ﬁrst-stage translator with
                                                                          a 100-best reranking which uses decoder features
                                                                          and a neural sequence-to-sequence network de-
                          Figure 2: Two-stage post-ordering               scribed in Section 2.6. To train the network, we
                                                                          used the same corpus used to train the ﬁrst-stage
                    Figures 3 and 4 illustrate the training of two-       PBMTtranslator (incorporating Turkish segmen-
                  stage postordering systems. We explore two op-          tation and English reordering).
                  tions for the training of the second, ”reordering”        NMT reranking yields an additional +0.47
                  stage: as the source-side, we can either use (a) the    BLEUscore.
                  reordered English sentences, or (b) Turkish sen-        3.5   Final system
                  tences translated to reordered English with ﬁrst-
                  stage translator.                                       The complete pipeline of our submitted system is
                                       English                            showninFigure 5.
                   Turkish        (in Turkish order)         English        Weselected the setup that performed best dur-
                                                                          ing experiments (#9 in Table 1), and re-tuned it on
                                                  Reorder                 the development set; for contrastive runs we also
                                                                          re-tuned baseline and ”fully automatic” systems
                          MT stage 1                                      (#1 and #8 respectively). See Table 1 for results.
                                                                            Ourbest setup reaches 15.17 BLEU, which is a
                                                                          +3.17 BLEUimprovementoverthebaseline.
                                                                            The system without the hand-crafted rules
                                                                          achieves a lower improvement of +1.89 BLEU,
                  Figure 3: Training the ”monotonous translation”         whichisanicegainnevertheless. Comparing runs
                  stage of post-ordering system                           #2and#3,weseethatthedecreaseinBLEUisnot
                                                                          duetothequalityofmorphologicalanalysis; com-
                   Turkish             English               English      paringruns#3and#5,weseethatthedifferencein
                                  (in Turkish order)                      quality is purely due to the segmentation scheme.
                        (b) Translate using 
                              stage 1                                     4   English-Turkish system
                                                  (a) Reorder
                                               MT stage 2                 4.1   Baseline
                                                                          As a baseline, we trained the same phrase-based
                                                                          system as in Section 3.1 (except we did not prune
                                                                          singleton n-grams in the Turkish language model).
                  Figure 4: Two options for training the ”reorder-          Baseline system achieves 8.51 uncased BLEU
                  ing” stage of post-ordering system                      on news-dev2016.
                    The two decoders have two sets of MERT co-            4.2   Pre-ordering
                  efﬁcients. We tune them jointly and iteratively:        We directly apply English-to-Turkish reorderers
                  ﬁrst, we tune the ﬁrst-stage decoder (with second-      described in Section 2.3 as a pre-processing step
                  stage coefﬁcients ﬁxed), optimizing BLEU of the         in the phrase-based MT pipeline, like e.g. (Xia
                  whole-system output, then we tune the second-           and McCord, 2004; Collins et al., 2005). Results
                  stage decoder (with ﬁrst-stage coefﬁcients ﬁxed),       are shown in Table 2
                  again optimizing the whole-system BLEU, and so            The rule-based reorderer earns +1.65 BLEU
                  on.                                                     against the baseline (run #2), so we selected it as a
                                                                      284
The words contained in this file might help you see if this file matches what you are looking for:

...Yandexschoolofdataanalysisapproachtoenglish turkishtranslation at wmtnewstranslationtask antondvorkovich sergeygubanov andirinagalinskaya dvorkanton esgv galinskaya yandex team ru school of data analysis timura frunze st moscow russia leo tolstoy abstract for morphological segmentation and english to turkish reordering we tried both describe the rule based supervised fully unsupervised translation systems sub approaches mitted by analy sis wmt news task commonsystemcomponents wesuccessfullyappliedhand craftedmor in our two phological de weusedseveralcommoncomponentsde syntax pre ordering scribed below post en thespecicapplication these tools varies glish perform so using smt propose a wediscuss it separately sections simple yet efcient modication also show that mor phrase translator phology word order can be handled used an house implementation automatic manner with only mt koehn et al berkeley small loss bleu aligner liang mert tuning introduction och participated syntactic parser sha...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area