Language Pdf 102326

Partial capture of text on file.

Multi-Lingual Phrase-Based Statistical Machine Translation for
Arabic-English
AhmedBastawisy and MohamedElmahdy
ComputerScience Department
GermanUniversity in Cairo, Cairo, Egypt
ahmed.bastawisy@student.guc.edu.eg, mohamed.elmahdy@guc.edu.eg
Abstract Arabic people. MSA is used in news broadcasts,
newspapers, formal speech, books, movies subti-
In this paper, we implement a multi- tling, and whenever the target audience or readers
lingual Statistical Machine Translation comefromdifferent nationalities. However, MSA
(SMT) system for Arabic-English Trans- is not the natural language for everyday life com-
lation. Arabic Text can be categorized munications and on social networks. In fact, di-
into standard and dialectal Arabic. These alectal Arabic is usually used in this case.
two forms of Arabic differ signiﬁcantly. A major problem in all Arabic Natural Lan-
Different mono-lingual and multi-lingual guage Processing tasks, and in particular Statis-
hybrid SMT approaches are compared. tical Machine Translation (SMT) is the existence
Mono-lingual systems do always result of the Arabic dialects. There exist signiﬁcant syn-
in better translation accuracy in one Ara- tactic, morphological, and lexical differences be-
bic form and poor accuracy in the other. tweenMSAandthedifferentArabicdialects. That
Multi-lingual SMTmodelsthataretrained is why they are sometimes considered as com-
with pooled parallel MSA/dialectal data pletely different languages (Soudi et al., 2012;
result in better accuracy. However, since Elmahdyetal., 2012)
the available parallel MSA data are much There were big efforts exerted to improve
larger compared to dialectal data, multi- Arabic-English SMT, most of these efforts were
lingual models are biased to MSA. We focusedonMSAratherthandialectalArabic. This
propose in the work, a multi-lingual com- is mainly due to the fact that the vast majority of
binationofdifferentmono-lingualsystems available parallel Arabic data are for MSA, whilst
using an Arabic form classiﬁer. The out- relatively sparse and limited parallel data are avail-
come of the classier directs the system able for dialectal Arabic (Alqudsi et al., 2014).
to use the appropriate mono-lingual mod- Totackle the problem of dialectal Arabic paral-
els (standard, dialectal, or mixture). Test- lel data sparsity, in many previous, they have nor-
ing the different SMT systems shows that malized dialectal words/phrases into correspond-
the proposed classiﬁer-based SMT sys- ing MSA equivalents. This normalization, or piv-
tem outperforms mono-lingual and data- oting, is basically a rule-based approach to para-
pooled multi-lingual systems. phrase dialectal words into MSA. This normal-
1 Introduction ization would allow the usage of existing MSA
SMTsystems(SalloumandHabash,2013;Sawaf,
The Arabic language is the largest still living 2010).
Semitic language. Arabic is spoken by more than In (Zbib et al., 2012), instead of relying on nor-
350millionpeoplearoundtheworld. Itisalsoone malization or pivoting, they have collected extra
oftheﬁveofﬁciallanguagesoftheUnitedNations, dialectal Arabic parallel in combination to exist-
and the ﬁrst ofﬁcial language of twenty-two coun- ing MSA data. Results showed that the proposed
tries knownbytheArabworld. Arabicisalsoused pooling technique has improved translation accu-
asasecondlanguageformorethan1.2billionpeo- racy for dialectal Arabic. However, MSA transla-
ple. tion accuracy has slightly decreased.
Modern Standard Arabic (MSA) is currently Because of the complex morphological nature
considered the formal Arabic variety across all of Arabic, some prior work, as in (Lee, 2004), fo-
86
Proceedings of Recent Advances in Natural Language Processing, pages 86–89,
Varna, Bulgaria, Sep 4–6 2017.
https://doi.org/10.26615/978-954-452-049-6_013
cused on MSAmorphologicalanalysis to improve The English language model is used to estimate
Arabic SMT. the prior probability in all of the proposed SMT
The aim of this work is to build a Multilingual techniques.
Arabic SMTsystemthat supports MSA as well as The three translation models have been tested
dialectal Arabic. Another goal is that the addition with the three testing sets (MSA, dialectal,
of dialectal Arabic should not affect MSA trans- MSA+dialectal). As shown in Table 1, the MSA
lation accuracy. Moreover, since available MSA model has resulted in BLEU score of 34.8, 2.6,
data are always larger than dialectal data, the sys- and 18.7 on MSA, dialectal, and MSA+dialectal
temshould not be biased to MSA. testing sets. It is clear that the MSA model per-
In this paper, we propose training three differ- forms poorly on dialectal Arabic data. Using di-
ent Arabic SMT models. One model for MSA, alectal Arabic model, the results were 4.1, 15.9,
another system for Dialectal Arabic, and the last and 10.0 on MSA, dialectal, and MSA+dialectal
one is a hybrid model that is trained with a data respectively. It is clear that the dialectal model
pool of parallel Arabic-English for MSA and di- performs better on dialectal data, and performs
alectal Arabic. A pre-classiﬁer is built to choose poorly with MSA data. The hybrid model has re-
the appropriate model to be used. sulted in a better acceptable accuracy across both
MSA and dialectal Arabic. The hybrid model
2 Translation Models has resulted in 33.2, 12.3, and 22.8 BLEU for
Throughout this work, all translation models were MSA, dialectal, and MSA+dialectal respectively.
built using Giza Aligner and Moses SMT engine The hybrid model seemed to be a little bit bi-
(Philipp et al., 2007). Three translation models ased towards MSA as the relative decrease in the
havebeencreated: MSA-Englishmodel,dialectal- accuracy was -4.6% relative the MSA baseline
English model, and hybrid-English model. To model, and -22.6% relative to the dialectal base-
train the MSA-English translation model, a par- line model.
allel dataset of 26M words was utilized from the Translation Parallel data type
ISI Arabic-English Automatically Extracted Par- model MSA Dialect. MSA+Dialect.
allel Text corpus (Dragos and Daniel, 2007). An MSA 34.8 2.6 18.7
independent MSA-English evaluation set of 300K Dialectal 4.1 15.9 10.0
words was used to tune the model. A MSA- Hybrid 33.2 12.3 22.8
English test set of 300K words is used to evaluate
MSA-Englishtranslation accuracy. Table 1: BLEU score for the different SMT sys-
Totrain the dialectal-English translation model, tems on MSA, dialectal, and MSA+dialectal data.
a parallel dataset of 2.7M words was utilized from
the Arabic-Dialect/English Parallel Text corpus 3 Classiﬁcation-Based Translation
(Technologies et al., 2012) (notice the huge dif-
ference between the size of available MSA and Although before adding the classiﬁer, MSA and
dialectal data). An independent dialectal-English Dialectal Arabic-English SMT systems accuracy
evaluation set of 300K words was used to tune the were poor across the different variants, the hybrid
model. A dialectal-English test set of 300K words systemthatwastrainedwithbothMSAanddialec-
is used to evaluate MSA-English translation accu- tal data has resulted in better accuracy. However,
racy. the aim of the Classiﬁcation-Based Translation is
The hybrid translation model has been trained to further improve the accuracy across both dialec-
by pooling both training sets of MSA and dialec- tal and MSA, and to overcome the bias problem of
tal parallel data that consists of 26M MSA words the hybrid model.
and 2.7M dialectal words. Model tuning was per- Two classiﬁcation techniques have been used,
formed using the two evaluation sets of MSA and the ﬁrst technique is to classify input Arabic text
dialectal Arabic. into two classes Standard and Dialectal, and ac-
Astatistical tri-gram language model is trained cordingly translate them with the appropriate sys-
for English. Language model training set con- tem. ThesecondtechniqueistoclassifyinputAra-
sists of 688M words from 2011 and 2012 articles bic text into three classes Standard, Hybrid and
(News Crawl) that is described in (Soﬁa, 2013). Dialectal, and then use the appropriate system ac-
87
cordingly. ments which have scored more than the threshold
A tri-gram MSA language model is built for -3.7. The second group contains the dialectal Ara-
the sake of classiﬁcation. More than 355M words bic classiﬁed segments which have scored below
from the Arabic Gigaword corpus (Parker et al., or equal the score threshold -3.7.
2011) were used to train a MSA language model. After that, each classiﬁed Arabic text ﬁle was
The MSA language model is used in text clas- translated with the corresponding SMT system,
siﬁcation by scoring every input sentence by the and then all translations were evaluated with the
language model. Sentences with high log likeli- BLEUscoretest.
hood are classiﬁed as MSA, whilst sentences with
low log likelihood are classiﬁed as Dialectal. 3.2 SecondClassiﬁcation Technique
3.1 First Classiﬁcation Techniques In this technique, instead of having a sharp thresh-
old between MSA and dialectal classes, we have
In the techniques, text segments are classiﬁed into created a windowwiththeoptimalthresholdinthe
twocategories: MSA or dialectal. Two-passes op- middle. Any sentence with a score that lies in this
timizationsearchwasmadetoﬁndtheoptimallan- window is classiﬁed with a third class. That class
guage model scoring threshold between MSA and is labeled the mixture class. It is assumed that
dialectal classes. anysentenceinthisclass (very close to the thresh-
In the ﬁrst pass, a coarse search was performed old) might contain a mixture of dialectal and MSA
by varying classiﬁcation threshold from 0.0 to - words, which is a common case on social media
10.0 with a coarse step of 1.0. For each iteration, for instance. The optimal window range has been
classiﬁcation accuracy is evaluated. The initial op- found to be from -2.7 to -5.45. The three classes
timal threshold was found to be -4.0 which has re- in this case are: Dialectal, MSA, and mixture.
sulted in classiﬁcation accuracy of 95.58% on the The test set is classiﬁed into three ﬁle groups,
evaluation sets of MSA and dialectal Arabic. the ﬁrst group contains MSA sentences, which has
In the second optimization pass, a ﬁne step scored more than the window upper bound -2.7,
search was performed around the initial -4.0 the second group has the Hybrid Arabic classi-
threshold with a variable value of -3.0 to -5.0 with ﬁedsentences, whichhasscorewithinthewindow
a step of 0.1. Figure 1 shows classiﬁer’s accuracy from -2.7 to -5.45, the third group has the Dialec-
test with a ﬁne step of 0.1 (x = x−0.1). Asshown tal classiﬁed sentences, which has scored less than
in the graph, the optimal threshold is -3.7 which the window lower bound -5.45.
has resulted in classiﬁcation accuracy of 96.64%. After that, each classiﬁed Arabic text ﬁle was
Thus, threshold of -3.7 has been used. translated with the corresponding SMT system,
and then all translations were evaluated with the
BLEUscoretest.
4 Experimental Results
The two classiﬁcation-based translation tech-
niques have been tested on a test set that combines
bothtestingsetsofMSA(300Kwords)anddialec-
tal Arabic (300K words).
Theﬁrst classiﬁcation technique has resulted in
a BLEUtranslation accuracy of 29.1 absolute out-
performing the hybrid model with a relative in-
crease in the accuracy of 27.6% as shown in Table
2.
Figure 1: Fine tuning graph for MSA/dialectal Thesecondclassiﬁcationtechniquehasresulted
classiﬁcation threshold. in a BLEU translation accuracy of 29.0 absolute
outperforming the hybrid model with a relative in-
In this technique, the classiﬁer works on classi- crease of 27.2%.
fying the test set and generating two ﬁle groups, AsshowninTable 2, both techniques have sig-
the ﬁrst group contains the MSA classiﬁed seg- niﬁcantly improved translation accuracy in com-
88
parison to all of the three baseline systems. This Stefan Munteanu Dragos and Marcu Daniel. 2007.
means that introducing a pre-classiﬁcation stage ISI Arabic-English Automatically Extracted Parallel
might be a helpful step in improving the perfor- Text LDC2007T08. Web Download. Philadelphia:
manceofArabicmachinetranslation systems. Linguistic Data Consortium.
The BLEU score is slightly better in the ﬁrst Mohamed Elmahdy, Rainer Gruhn, and Wolfgang
classiﬁcation technique than the second one with Minker. 2012. Novel Techniques for Dialectal Ara-
an absolute difference of 0.1. This implies that it bic Speech Recognition. Springer-Verlag New York,
is enoughtoclassifyinputArabictextintojusttwo 1 edition.
categories instead of three. Young-Suk Lee. 2004. Morphological analysis for sta-
tistical machine translation. In Proceedings of HLT-
Technique BLEU Relative NAACL 2004: Short Papers. Association for Com-
putational Linguistics, pages 57–60.
Hybrid Model 22.8 baseline Parker, Robert, et al. 2011. Arabic Gigaword Fifth
Classiﬁer-based 1 29.1 +27.6% Edition (LDC2011T11). Linguistic Data Consor-
Classiﬁer-based 2 29.0 +27.2% tium.
Table 2: Translation accuracy on MSA+dialectal KoehnPhilipp, Hoang Hieu, et al. 2007. Moses: Open
parallel data for the hybrid model, classiﬁer-based Source Toolkit for Statistical Machine Translation.
Annual Meeting of the Association for Computa-
technique 1, and classiﬁer-based technique 2. tional Linguistics (ACL).
5 Conclusions Wael Salloum and Nizar Habash. 2013. Dialectal Ara-
bic to English machine translation: Pivoting through
modern standard Arabic. In HLT-NAACL. pages
This paper has focused mainly on enhancing the 348–358.
accuracy of SMT across MSA and dialectal Ara- HassanSawaf.2010. Arabicdialect handling in hybrid
bic. Three baseline Arabic-English SMT systems machine translation. In Proceedings of the confer-
were built: MSA, dialectal, and Hybrid. MSA ence of the association for machine translation in
system resulted in signiﬁcantly low accuracy on the americas (amta), denver, colorado.
dialectal data, whilst dialectal system resulted in Soﬁa. 2013. News Crawl (articles from 2011 and
low accuracy on MSA data. The hybrid system 2012). web Download. Shared Task: Machine
performed with a better average accuracy across Translation.
both MSAanddialectal data. Abdelhadi Soudi, Ali Farghaly, Gunter Neumann, and
In order to classify input text into the correct Rabih Zbib. 2012. Challenges for Arabic Machine
variety of Arabic (dialectal or MSA), two classi- Translation. Natural Language Processing 9. Ben-
ﬁcation techniques have been proposed. The ﬁrst jamins, John.
technique classiﬁes the testing data into two cate- Raytheon BBN Technologies, Linguistic Data Con-
gories, one to be translated with the MSA model, sortium, and Sakhr Software. 2012. Arabic-
and the other to be translated with the dialectal Dialect/English Parallel Text (LDC2012T09). Lin-
model. The second technique classiﬁes the testing guistic Data Consortium.
data into three classes, one to be translated with Rabih Zbib, Erika Malchiodi, Jacob Devlin, David
the MSA model, one to be translated with the hy- Stallard, Spyros Matsoukas, Richard Schwartz, John
brid model, and the last one to be translated with Makhoul, Omar F Zaidan, and Chris Callison-
the dialectal model. Burch. 2012. Machine translation of arabic dialects.
Both techniques have signiﬁcantly improved In Proceedings of the 2012 conference of the north
translation accuracy on a balanced testing set that american chapter of the association for computa-
tional linguistics: Human language technologies.
containsequalamountsofMSAanddialectaldata. Association for Computational Linguistics, pages
The ﬁrst technique resulted in a slightly better 49–59.
BLEUscorethanthesecondclassiﬁcation one.
References
ArwaAlqudsi, Nazlia Omar, and Khalid Shaker. 2014.
Arabic machine translation: a survey. Artiﬁcial In-
telligence Review 42(4):549–572.
89

The words contained in this file might help you see if this file matches what you are looking for:

...Multi lingual phrase based statistical machine translation for arabic english ahmedbastawisy and mohamedelmahdy computerscience department germanuniversity in cairo egypt ahmed bastawisy student guc edu eg mohamed elmahdy abstract people msa is used news broadcasts newspapers formal speech books movies subti this paper we implement a tling whenever the target audience or readers comefromdifferent nationalities however smt system trans not natural language everyday life com lation text can be categorized munications on social networks fact di into standard dialectal these alectal usually case two forms of differ signicantly major problem all lan different mono guage processing tasks particular statis hybrid approaches are compared tical existence systems do always result dialects there exist signicant syn better accuracy one ara tactic morphological lexical differences bic form poor other tweenmsaandthedifferentarabicdialects that smtmodelsthataretrained why they sometimes considered as...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area