247x Filetype PDF File size 0.42 MB Source: www.statmt.org
NUIG-Panlingua-KMIHindi↔MarathiMTSystemsforSimilar
LanguageTranslation Task @ WMT2020
!+ ! + !
Atul Kr. Ojha , Priya Rani , Akanksha Bansal , Bharathi Raja Chakravarthi ,
* !
Ritesh Kumar , John P. McCrae
!Data Science Institute, NUIG, Galway, +Panlingua Language Processing LLP,
NewDelhi,*Dr. Bhimrao Ambedkar University, Agra
(atulkumar.ojha,priya.rani,bharathi.raja)@insight-centre.org,
panlingua@outlook.com,john.mccrae@nuigalway.ie,
ritesh78 llh@jnu.ac.in
Abstract as OpenNMT(Kleinetal.,2017),Marian(Junczys-
NUIG-Panlingua-KMI submission to WMT Dowmunt et al., 2018) and Neamtus (Sennrich
2020 seeks to push the state-of-the-art in the et al., 2017), which provide various ways of ex-
Similar language translation task for the Hindi perimenting with the use of different features and
↔Marathi language pair. As part of these ef- architectures, yet it fails to achieve the same re-
forts, we conducted a series of experiments sults with low resource languages (Chakravarthi
to address the challenges for translation be- et al., 2018, 2019b). However, Sennrich and Zhang
tween similar languages. Among the 4 MT (2019) revisited the NMT models and tuned hyper-
systems prepared for this task, 1 PBSMT sys- parameters, changed network architectures to op-
tems were prepared for Hindi ↔ Marathi each timize NMTfor low-resource conditions and con-
and 1 NMTsystemsweredevelopedforHindi cluded that low-resource NMT is very sensitive
↔Marathi using Byte Pair Encoding (BPE) to hyper-parameters such as Byte Pair Encoding
of subwords. The results show that different
architectures in NMT could be an effective (BPE) vocabulary size, word dropout, and others.
methodfordevelopingMTsystemsforclosely This paper is an extension of our work Ojha et al.
related languages. Our Hindi-Marathi NMT (2019) submitted to WMT 2019 similar language
th
systemwasranked8 amongthe14teamsthat translation task. Therefore our team adapted meth-
participated and our Marathi-Hindi NMT sys- ods of the low resource setting for NMT proposed
th
temwasranked8 amongthe11teamspartic- by Sennrich and Zhang (2019) to explore the fol-
ipated for the task. lowing broad objectives:
1 Introduction
Developing automated relations between closely • to compare the performance of SMT and
related languages is a contemporary concern espe- NMTincaseofcloselyrelated,relatively low-
cially in the domain of Machine Translation(MT). resourced language pairs, and
Hindi and Marathi exhibit a significant overlap in • to findout how to leverage the accuracy of
their vocabularies and strong syntactic plus lexi- NMTincloselyrelated languages using BPE
cal similarities. These striking similarities seem into subwords.
promising in enhancing the possibility of mutual
inter-comprehension within closely related lan- • to analyze the effects of data quality in perfor-
guages. However, automated translation between manceofthesystems.
such closely related languages is a rather challeng-
ing task. 2 SystemDescription
Thelinguistic similarities and regularities in mor-
phological variations and orthography motivate the This section provides an overview of the systems
use of character-level translation models, which developedfortheWMT2020SharedTask. Inthese
have been applied to translation (Vilar et al., experiments, the NUIG-Panlingua-KMI team ex-
2007; Chakravarthi et al., 2020) and translitera- plored two different approaches: phrase-based sta-
tion (Matthews, 2007; Chakravarthi et al., 2019a; tistical (Koehn et al., 2003), and neural method for
Chakravarthi, 2020). In the past few years, neu- Hindi-Marathi and Marathi-Hindi language pairs.
ral machine translation systems have achieved In all the submitted systems, we use the Moses
outstanding performance with high resource lan- (Koehn et al., 2007) and Nematus (Sennrich et al.,
guages, with the help of open source toolkit such 2017) toolkit for developing statistical and neural
418
Proceedings of the 5th Conference on Machine Translation (WMT), pages 418–423
c
Online, November 19–20, 2020.
2020 Association for Computational Linguistics
machine translation systems respectively. The pre- Outof43274training sentences, the Hindi corpus
processing was done to handle noise in data (for hadTelugusentences while the Marathi corpus had
example, different language sentences, non-UTF Meitei sentences intermingled as shown in first row
characters etc), the details of which are provided in (Figure 1). The parallel data had more than 1192
section 3.1 lines that were not comparable with each other as
2.1 Phrase-based SMTSystems shown in second and third row (Figure 1), where
someHindisentences had only half the sentences
ThesesystemswerebuiltontheMosesopensource translated in Marathi (second row) and some had
toolkit using the KenLM (Heafield, 2011) language blank spaces against their Marathi counter parts
modelandGIZA++(OchandNey,2003)aligner. (third row). The translation quality of the parallel
‘Grow-diag-final-and heuristic’ parameters were datawasalsonotuptomark. Infact,theteamcould
used to extract phrases from the corresponding par- locate a few instances of synthetic data. There were
allel corpora. In addition to this, KenLM was used a few sentences where character encoding was an
to build 5-gram language models. issue, hence were completely unintelligible.
2.2 Neural Machine Translation System LanguagePair Training Tuning Monolingual
Nematus was used to build 2 NMT systems. As Hindi ↔Marathi 43274 1411 -
Marathi - - 326748
we mentioned in an earlier section, at first data Hindi - - 75348193
waspre-processed at subwords level with BPE for Table 1: Statistics of Parallel and Monolingual Sen-
neural translation, and then the system was trained tences of the Hindi and Marathi Languages
using Nematus toolkit. Most of the system features
were adopted from (Sennrich et al., 2017; Koehn
and Knowles, 2017) (see section 3.3.2). 3.2 Pre-processing
2.3 Assessment Thefollowingpre-processingstepswereperformed
Assessment of these systems was done on the stan- as part of the experiments:
dard automatic evaluation metrics: BLEU (Pap-
ineni et al., 2002), Rank-based Intuitive Bilingual a) Bothcorporaweretokenizedandcleaned(sen-
Evaluation Score (RIBES) (Isozaki et al., 2010) tences of length over 80 words were removed).
and Translation Error Rate (TER) (Snover et al.,
2006). b) For neural translation, training, validation and
test data was prepossessed into subwords BPE
3 Experiments format. This format was utilised to prepare
This section briefly describes the experiment set- BPEandvocabularyfurther used.
tings for developing the systems. All these processes were performed using Moses
3.1 DataPreparations scripts. However, the tokenization was done by the
RGNLPteamtokenizer(Ojhaetal., 2018) and In-
Theparallel data-set for these experiments was pro- 3
dic nlp library. These tokenizers were used since
vided by the WMT Similar Translation Shared Task Mosesdoesnot provide a tokenizer for Indic lan-
1 organisers and the Marathi monolingual data-set guages. Also the RGNLP tokenizer ensured that
wastaken from WMT2020SharedTask: Parallel the canonical Unicode representation of the charac-
2
Corpus Filtering for Low-Resource Conditions. ters are retained.
Theparallel data was sub-divided into training, tun-
ing, and monolingual sets, as detailed in Table 1. 3.3 DevelopmentoftheNUIG-Panlingua-
However, the shared data was very noisy. KMIMTSystems
To enhance the data quality, the team had to After removing noisy and pre-processing data, the
undertake an extensive pre-processing session fo- following steps were followed to build the NUIG-
cused on identifying and cleaning the data-sets. Panlingua-KMI MTsystems:
1http://www.statmt.org/wmt20/similar.
html 3https://github.com/anoopkunchukuttan/
2https://wmt20similar.cs.upc.edu/ indic_nlp_library
419
Figure 1: Examples of discrepancies in Hindi-Marathi parallel data
Figure 2: Analysis of the PBSMT and NMT’s Systems
3.3.1 Building Primary MT Systems: guage models were trained on 5-gram. After that,
As previously mentioned, the Hindi-Marathi and thesystemswerebuiltindependentlyandcombined
Marathi-Hindi PBSMT systems were built as the in a loglinear scheme in which each model was as-
primary submission using Moses. The language signed a different weight using the Minimum Error
modelwasbuilt first, using KenLM. For Marathi- Rate Training (Och, 2003) tuning algorithm. To
Hindi and Hindi-Marathi language pairs, the lan- train and tune the systems, we used 40454 and 1411
420
parallel sentences, respectively, for all language both the language pairs, subword based NMT per-
pairs. formedbetter than PBSMTasits accuracy rate was
3.3.2 Building Contrastive MT Systems: higher in BLEU and lower in TER metrics, shown
As mentioned in the previous section, Nematus in Table 2.
toolkit was used to develop the NMT systems. The 4.2 Analysis
training was done on subword and character-level. Weusedthereference set provided by the shared
All the NMT experiments were carried out only task organizers to evaluate both PBSMT and NMT
with a data-set that contained sentences with length systems. Even though subword based NMT system
of up to 80 words. The neural model is trained on could take advantage of the shared features among
5000epochs, using Adam with a default learning similar languages, challenges in translating a few
rate of 0.002, dropout at 0.01 and mini-batches of linguistics structures acted as a constraint. Exam-
80 and the batch size for the validation was 40. ple 1 shown in Figure 2 is one of the challenging
Vocabulary size of 30000 for both Marathi-Hindi structures that the system was unable to translate.
and Hindi-Marathi language pairs was extracted. In these sentences the systems could not capture
Remainingparameters were limited with the use of the correct tense and aspect which is past perfect
default hyper-parameters configuration. in source sentence whereas the NMT system trans-
4 Evaluation lated it as simple past. The second most common
challenging structures that needed special attention
All the systems were evaluated using the reference were the postpositions as shown in Example 2 and
set provided by the shared task organizers. The 3 in the figure. In most cases, the system over-
standard MT evaluation metrics, BLEU (Papineni generalised the sentences in Marathi and generated
et al., 2002) score, RIBES (Isozaki et al., 2010) and unnecessary postposition phrases in Hindi as in Ex-
TER(Snoveretal., 2006), were used for automatic ample 2. Similarly, we can see in Example 3 while
evaluation. These results were prepared on the Pri- translating from Hindi to Marathi both PBSMT and
maryandContrastivesystemsubmissionwhichare NMTsystemsusedwrongpost-positions.
mentioned in the Table 2 as P and C, where P
stands for Primary and C stands for Contrastive, 5 Conclusion
respectively. It gives a quantitative picture of partic- Ourexperiment results reveal that subword based
ular differences across different systems, especially NMTcouldtakeadvantageoftherelation between
with reference to evaluation scores (Table 2) the similar language to boost the accuracy of neural
System BLEU RIBES TER machine translations system in low resource data
Hindi-Marathi P 9.38 51.88 91.24 settings. As BPE units are variable-length units
Hindi-Marathi C 9.76 52.18 91.49 and the vocabularies used are much smaller than
Marathi-Hindi P 17.38 59.31 81.47 morphemeandword-levelmodel,theproblemof
Marathi-Hindi C 17.39 58.84 81.15
Table 2: Accuracy of Hindi↔Marathi MT Systems at data sparsity does not occur. On the contrary, it
BLEU,RIBESandTERMetrics provides an appropriate context for translation be-
tween similar languages. However, the quality of
datausedtotrainthesystemsdoesaffectthequality
4.1 Results of translation. Thus, we could conclude that shared
Overall weseevaryingperformanceamongthesys- features between two languages could be an advan-
tem submitted to the task, with some performing tage to leverage the accuracy of NMT systems for
muchbetter out-of-sample than others. The NUIG- closely related languages.
Panlingua-KMI subword NMT system took 8th po- Acknowledgments
sition for both Hindi-Marathi and Marathi-Hindi
language pair, across 14 teams. Our subword NMT This publication has emanated from research in
systems for Marathi-Hindi language pair showed part supported by the Irish Research Council un-
better results in terms of all the three metrics (17.39 der grant number SFI/18/CRT/6223 (CRT-Centre
in BLEU,58.84inRIBESand81.15inTER)while for Research Training in Artificial Intelligence) co-
the Hindi-Marathi language pair scored 9.76 in funded by the European Regional Development
BLEU,52.18inRIBESand91.24inTER.Across FundaswellasbytheEUH2020programmeun-
421
no reviews yet
Please Login to review.