272x Filetype PDF File size 0.17 MB Source: aclanthology.org
Domain Adaptation for Hindi-Telugu Machine Translation
using Domain Specific Back Translation
Hema Ala Vandan Mujadia Dipti Misra Sharma
LTRC LTRC LTRC
IIIT Hyderabad IIIT Hyderabad IIIT Hyderabad
hema.ala@research.iiit.ac.in vandan.mu@research.iiit.ac.in dipti@iiit.ac.in
Abstract now provided a new domain data, the chal-
lenge is to improve the translation quality of
In this paper, we present a novel approach
that domain using available little amount of
for domain adaptation in Neural Machine
parallel domain data. We adopted two techni-
Translation which aims to improve the
cal domains namely, Chemistry and Artificial
translation quality over a new domain.
Intelligence for Hindi -> Telugu and Telugu ->
Adapting new domains is a highly challeng-
ing task for Neural Machine Translation on Hindi experiments.
limited data, it becomes even more diffi-
The parallel data for the mentioned techni-
cult for technical domains such as Chem-
cal domains is very less, hence we used back
istry and Artificial Intelligence due to spe-
translation to create synthetic data. Instead of
cific terminology, etc. We propose Domain
using synthetic data directly which may con-
Specific Back Translation method which
tain lots of noise we used domain monolingual
uses available monolingual data and gen-
data to create synthetic data in a different way
erates synthetic data in a different way.
(see section 3.4) and used such that translation
This approach uses Out Of Domain words.
The approach is very generic and can be of domain terms and context around them is
applied to any language pair for any do-
accurate.
main. We conduct our experiments on
Chemistry and Artificial Intelligence do-
2 Background & Motivation
mains for Hindi and Telugu in both direc-
tions. It has been observed that the usage As noted by Chu and Wang (2018) there are
of synthetic data created by the proposed
two important distinctions to make in do-
algorithm improves the BLEU scores sig-
main adaptation methods for Machine Trans-
nificantly.
lation(MT). The first is based on data re-
quirements, supervised adaptation relies on in-
1 Introduction
domain parallel data, and unsupervised adap-
Neural Machine Translation (NMT) systems tation has no such requirement. There is
achieved a breakthrough in translation qual- also a difference between model-based and
ity recently, by learning an end-to-end system data-based methods. Model-based methods
(Bahdanauetal., 2014)(Sutskever et al., 2014). make explicit modifications to the model ar-
These systems perform well on the general do- chitecture such as jointly learning domain
main on which they trained, but they fails to discrimination and translation (Britz et al.,
produce good translations for a new domain 2017), interpolation of language modeling and
the model is unaware of. translation (Gulcehre et al., 2015; Domhan
Adapting to a new domain is highly chal- and Hieber, 2017) and domain control by
lenging task for NMTsystems, itbecomeseven adding tags and word features (Kobus et al.,
more challenging when it comes to technical 2016). Zeng et al. (2019) proposed itera-
domains like Chemistry, Artificial Intelligence tive dual domain adaptation framework for
etc, as they contain many domain specific NMT, which continuously fully exploits the
words. In a typical domain adaptation sce- mutual complementarity between in domain
nario like ours, we have a tremendous amount and out domain corpora for translation knowl-
of general data on which we train an NMT edge transfer. Apart from this Freitag and
model, we can assume this as a baseline model, Al-Onaizan (2016) proposed two approaches,
26
Proceedings of Recent Advances in Natural Language Processing, pages 26–34
Sep 1–3, 2021.
https://doi.org/10.26615/978-954-452-072-4_004
one is to continue the training of the baseline be discussed in detail in section 3.4. Huck
model(general model) only on the in-domain et al. (2019) also created synthetic data us-
data (domain data) and the other is to en- ing OOV in a different way, whereas we used
semble the continue model with the baseline OODwords to create synthetic data.
model at decoding time. Coming to the data-
based methods for domain adaptation, it can 3 Methodology
bedoneintwoways,combiningin-domainand
As discussed in section 2 there are many ap-
out-of-domain parallel corpora for supervised
proaches for domain adaptation mainly di-
adaptation (Luong et al., 2015) or by gener-
vided into model-based and data-based meth-
ating pseudo-parallel corpora from in-domain
ods. However our approach falls under data-
monolingual data for unsupervised adaptation
based method, we discuss this in detail in sec-
(Sennrich et al., 2015a; Currey et al., 2017).
tion 3.3. Though, there exists many domain
Our approach follows a combination of
adaptation works in MT, to the best of our
both supervised and unsupervised approaches.
knowledge there is no such work for Indian
where we first combine domain data (Chem-
languages especially which considers technical
istry and Artificial Intelligence ) with general
domains like Chemistry, Artificial Intelligence
data, train a domain adaptation model. Then,
etc. Hence there is a huge need to work on In-
as an unsupervised approach we use available
dian Languages where most of them are mor-
domain monolingual data to back translate
phologically rich and these type of domains
and use to create domain adaptation model.
(technical domains) to improve the translation
Burlot and Yvon (2019) explained how we can
of domain specific text that contain many do-
usemonolingualdataeffectivelyinourMTsys-
main terms etc.
tems, Inspired from Burlot and Yvon (2019),
Weconducted all our experiments for Hindi
instead of just adding domain parallel data
and Telugu in both directions for Chemistry
which is very small in amount to general data
and Artificial Intelligence.The language pair
we used available domain monolingual data to
(Hindi-Telugu) considered in our experiments
generate synthetic parallel data.
are morphologically rich therefore, there exists
many post positions, inflections etc. In order
In Burlot and Yvon (2019) they have ana-
to handle all these morphological inflections
lyzed various ways to integrate monolingual
we used Byte Pair Encoding (BPE), we can
data in an NMT framework, focusing on their
see detail explanation about BPE in section
impact on quality and domain adaptation. A
3.2.
simple way to use monolingual data in MT
is to turn it into synthetic parallel data and
3.1 Neural Machine Translation
let the training procedure run as usual (Bo-
jar and Tamchyna, 2011), but this kind of syn- NMT system attempts to find the condi-
thetic data may contain huge noise which leads tional probability of the target sentence with
to performance degradation of domain data. the given source sentence. There exist sev-
Therefore, we present an approach which gen- eral techniques to parameterize these con-
erates synthetic data in a way such that it is ditional probabilities.Kalchbrenner and Blun-
more reliable and improves the translation. In som (2013) used combination of a convolution
thecontextofphrase-basedstatisticalmachine neural network and a recurrent neural net-
translation Daumé Iii and Jagarlamudi (2011) work , Sutskever et al. (2014) used a deep
has noted that unseen (OOV) words account Long Short Term Memory (LSTM) model,
for a large portion of translation errors when Cho et al. (2014) used an architecture similar
switching to new domains, however this prob- to the LSTM, and Bahdanauetal. (2014) used
lem is still exist even in NMT as well. Con- a more elaborate neural network architecture
sidering this issue, inspired from Huck et al. that uses an attention mechanism over the in-
(2019) we proposed a novel approach called put sequence. However all these approaches
domain specific back translation which uses are based on RNN’s and LSTM’s etc, but be-
Out Of Domain(OOD) words to create syn- cause of the characteristics of RNN, it is not
thetic data from monolingual data which will conducive to training data in parallel so that
27
the model training time is often longer, by ad-
dressing this issue Vaswani et al. (2017) pro-
posed Transformer framework based on a self-
Algorithm 1: Generic Algorithm for
attention mechanism. Inspired from Vaswani
Domain Specific Back Translation
et al. (2017) we used Transformer architecture
Let us say L1 and L2 are language pair
in all our experiments.
(translation can be done in both
directions L1 -> L2 and L2 -> L1)
3.2 Byte Pair Encoding
1. Training Corpus : Take all available
BPE(Gage, 1994) is a data compression tech-
L1 - L2 data (except domain data)
nique that substitutes the most frequent pair 2. Train two NMT models (1. L1 -> L2
of bytes in a sequence with a byte that does
[L1-L2] 2. L2 -> L1 [L2-L1])
not occur within that data. Using this we 3. for domain in all domains do
can acquire the vocabulary of desired size 1.Take L1 domain data , list down
and can handle rare and unknown words as
all Out Of Domain words from L1
well (Sennrich et al., 2015b). As Telugu and
Training Corpus [say this is
Hindi are morphologically rich languages, par-
OODL1with respect to given
ticularly Telugu being more Agglutinative lan-
domain]
guage, there is a need to handle post posi- 2.Take L2 domain data, list down
tions and compound words, etc. BPE helps
all Out Of Domain words from L2
the same by separating suffix, prefix, and com-
Training Corpus [say this is
pound words. NMT with BPE made signifi-
OODL2with respect to given
cant improvements in translation quality for
domain]
low resource morphologically rich languages
end
(Pinnis et al., 2017). We also adopted the same 4. Now take monolingual data for L1
for our experiments and got the best results
and L2
with a vocabulary size of 30000. 5. for all domains do
1. Get N sentences from L1
3.3 Domain Adaptation
monolingual data where OODL1
Domainadaptation aims to improve the trans-
are present [Mono-L1]
lation performance of a model (trained on gen- 2 Get N sentences from L2
eral data) on a new domain by leveraging the
monolingual data where OODL2
available domain parallel data. As discussed
are present [Mono-L2]
in section 2 there are multiple approaches to 3. Run L2-L1 on Mono-L2 to get
do it broadly divided into model-based and
Back Translated data for L1 -> L2
data-based however, our approach falls under
(BT[L1-L2]
data-based methods, where one can combine 4. Run L1-L2 on Mono-L1 to get
the available little amount of domain parallel
Back Translated data for L2 -> L1
data to general data. In this paper we show
(BT[L1-L2])
how usage of domain specific synthetic data
end
improves the translation performance signifi- ∗. Steps to Extract OOD
cantly. The main goal of this method is to use
words(mentioned in step 3) for all
domain-specific synthetic parallel data using
domains for all languages:
the approach mentioned in section (3.4) along ∗. for word in unique words of domain
with little amount of domain parallel data.
data do
∗. if word not in unique words of
3.4 Domain Specific Back Translation
general data
In our experiments we followed data-based ap-
then that will be extracted as OOD
proach, we combined domain data with gen-
word with respect to that domain
eral data and trained a new model as a domain end
adaptation model.
Due to the fact that the domain data is very
less we can use available monolingual data to
28
Domains #Sentences #Tkns(te) #Unique Tkns(te) #Tkns(hi) #Unique Tkns(hi)
General 431975 5021240 443052 7995403 123716
AI 5272 57051 11900 89392 5479
Chemistry 3600 72166 10166 97243 6792
Table 1: Parallel Data for Hindi - Telugu
Langs #Sent #Tkns UTkns
each step of the algorithm can be interpreted
as follows. step 1. The training corpus is
Hindi 16345 175931 17405
general data mentioned in Table 1. step 2.
Telugu 39583 339612 86942
We train 2 models using the training corpus
Table 2: Monolingual Data from above step. One from Hindi to Tel-
ugu and the other is from Telugu to Hindi.
These models can be treated as base mod-
Domain-Lang #Sentencs #Tkns
els. step 3. This step is to find out OOD
AI-Hindi 14014 438848
words, this can be done as follows, In Algo-
AI-Telugu 22241 285234
rithm 1 this step explained in detail at the
Chemistry-Hindi 28672 982700
last. step 3.1 Get Unique words from gen-
Chemistry-Telugu 34322 425515
eral corpus, say Gen-Unique for both the lan-
guages step 3.2 Get Unique words from Chem-
Table 3: Selected monolingual data for domain spe-
istry corpus, Chem-Unique for both the lan-
cific back translation
guages step 3.3 Get Unique words from AI
corpus, AI-Unique for both languages step
3.4 Now, take each word from Chem-Unique
generate synthetic parallel data. Leveraging
and check that word in Gen-Unique If it not
monolingual data attained significant improve-
found then that can be considered as Chem-
ments in NMT(Domhan and Hieber, 2017;
istry OOD words. We get OOD Hindi and
Burlot and Yvon, 2019; Bojar and Tamchyna,
OODTelugu with respect to Chemistry. step
2011; Gulcehre et al., 2015). Using back trans-
3.5 take each word from AI-Unique and check
lation we can generate synthetic parallel data
that word in Gen-Unique If it not found then
but that might be very noisy which will de-
that can be considered as AI OOD words. We
crease the domain specific translation perfor-
get OOD words Hindi and OOD words Tel-
mance. Hence we need an approach which ex-
ugu with respect to AI. step 4. Take monolin-
tracts only useful sentences and creates syn-
gual data for both languages mentioned in 2.
thetic data. Our approach addresses the same
step 5. Extract sentences from Hindi monolin-
by creating domain specific back translated
gual data where Hindi OODwordsw.r.tChem-
data using the algorithm mentioned in 1.
istry are present[Chem-Mono-Hindi]. step 5.1
Domain-specific Back Translation tries to
Extract sentences from Telugu monolingual
improve overall translation quality, particu-
data where Telugu OOD words w.r.t Chem-
larly translation of domain terms and domain-
istry are present[Chem-Mono-Telugu]. step
specific context implicitly. The generic algo-
5.2 Extract sentences from Hindi monolingual
rithm for domain-specific back translation is
data where Hindi OOD words w.r.t AI are
described in Algorithm 1. The algorithm is
present[AI-Mono-Hindi]. step 5.3 Extract sen-
very generic and can be applied to any lan-
tences from Telugu monolingual data where
guagepairforanydomain. Inourexperiments,
Telugu OOD words w.r.t AI are present[AI-
we adopted two domains namely Chemistry
Mono-Telugu]. step 6. Run Hindi -> Tel-
and Artificial Intelligence, one language pair
ugu model from step 2 on Chem-mono-Hindi
Hindi and Telugu in both directions.
to get Back Translated data [BT-Chem-Hindi-
Let us consider the mentioned languages in
Telugu] step 6.1 Run Telugu -> Hindi model
terms of algorithm mentioned in Algorithm 1
fromstep2onChem-mono-TelugutogetBack
where L1 as Hindi and L2 as Telugu, domains
Translated data [BT-Chem-Telugu-Hindi] step
are Chemistry and Artificial Intelligence. Now,
29
no reviews yet
Please Login to review.