jagomart
digital resources
picture1_Language Pdf 102326 | Ranlp013


 155x       Filetype PDF       File size 0.31 MB       Source: acl-bg.org


File: Language Pdf 102326 | Ranlp013
multi lingual phrase based statistical machine translation for arabic english ahmedbastawisy and mohamedelmahdy computerscience department germanuniversity in cairo cairo egypt ahmed bastawisy student guc edu eg mohamed elmahdy guc edu ...

icon picture PDF Filetype PDF | Posted on 23 Sep 2022 | 3 years ago
Partial capture of text on file.
                           Multi-Lingual Phrase-Based Statistical Machine Translation for
                                                                 Arabic-English
                                                 AhmedBastawisy and MohamedElmahdy
                                                          ComputerScience Department
                                                    GermanUniversity in Cairo, Cairo, Egypt
                          ahmed.bastawisy@student.guc.edu.eg, mohamed.elmahdy@guc.edu.eg
                                         Abstract                              Arabic people. MSA is used in news broadcasts,
                                                                               newspapers, formal speech, books, movies subti-
                       In this paper, we implement a multi-                    tling, and whenever the target audience or readers
                       lingual Statistical Machine Translation                 comefromdifferent nationalities. However, MSA
                       (SMT) system for Arabic-English Trans-                  is not the natural language for everyday life com-
                       lation.   Arabic Text can be categorized                munications and on social networks. In fact, di-
                       into standard and dialectal Arabic. These               alectal Arabic is usually used in this case.
                       two forms of Arabic differ significantly.                   A major problem in all Arabic Natural Lan-
                       Different mono-lingual and multi-lingual                guage Processing tasks, and in particular Statis-
                       hybrid SMT approaches are compared.                     tical Machine Translation (SMT) is the existence
                       Mono-lingual systems do always result                   of the Arabic dialects. There exist significant syn-
                       in better translation accuracy in one Ara-              tactic, morphological, and lexical differences be-
                       bic form and poor accuracy in the other.                tweenMSAandthedifferentArabicdialects. That
                       Multi-lingual SMTmodelsthataretrained                   is why they are sometimes considered as com-
                       with pooled parallel MSA/dialectal data                 pletely different languages (Soudi et al., 2012;
                       result in better accuracy. However, since               Elmahdyetal., 2012)
                       the available parallel MSA data are much                   There were big efforts exerted to improve
                       larger compared to dialectal data, multi-               Arabic-English SMT, most of these efforts were
                       lingual models are biased to MSA. We                    focusedonMSAratherthandialectalArabic. This
                       propose in the work, a multi-lingual com-               is mainly due to the fact that the vast majority of
                       binationofdifferentmono-lingualsystems                  available parallel Arabic data are for MSA, whilst
                       using an Arabic form classifier. The out-                relatively sparse and limited parallel data are avail-
                       come of the classier directs the system                 able for dialectal Arabic (Alqudsi et al., 2014).
                       to use the appropriate mono-lingual mod-                   Totackle the problem of dialectal Arabic paral-
                       els (standard, dialectal, or mixture). Test-            lel data sparsity, in many previous, they have nor-
                       ing the different SMT systems shows that                malized dialectal words/phrases into correspond-
                       the proposed classifier-based SMT sys-                   ing MSA equivalents. This normalization, or piv-
                       tem outperforms mono-lingual and data-                  oting, is basically a rule-based approach to para-
                       pooled multi-lingual systems.                           phrase dialectal words into MSA. This normal-
                   1   Introduction                                            ization would allow the usage of existing MSA
                                                                               SMTsystems(SalloumandHabash,2013;Sawaf,
                   The Arabic language is the largest still living             2010).
                   Semitic language. Arabic is spoken by more than                In (Zbib et al., 2012), instead of relying on nor-
                   350millionpeoplearoundtheworld. Itisalsoone                 malization or pivoting, they have collected extra
                   ofthefiveofficiallanguagesoftheUnitedNations,                 dialectal Arabic parallel in combination to exist-
                   and the first official language of twenty-two coun-           ing MSA data. Results showed that the proposed
                   tries knownbytheArabworld. Arabicisalsoused                 pooling technique has improved translation accu-
                   asasecondlanguageformorethan1.2billionpeo-                  racy for dialectal Arabic. However, MSA transla-
                   ple.                                                        tion accuracy has slightly decreased.
                      Modern Standard Arabic (MSA) is currently                   Because of the complex morphological nature
                   considered the formal Arabic variety across all             of Arabic, some prior work, as in (Lee, 2004), fo-
                                                                           86
                                        Proceedings of Recent Advances in Natural Language Processing, pages 86–89,
                                                              Varna, Bulgaria, Sep 4–6 2017.
                                                      https://doi.org/10.26615/978-954-452-049-6_013
                 cused on MSAmorphologicalanalysis to improve         The English language model is used to estimate
                 Arabic SMT.                                          the prior probability in all of the proposed SMT
                   The aim of this work is to build a Multilingual    techniques.
                 Arabic SMTsystemthat supports MSA as well as            The three translation models have been tested
                 dialectal Arabic. Another goal is that the addition  with the three testing sets (MSA, dialectal,
                 of dialectal Arabic should not affect MSA trans-     MSA+dialectal). As shown in Table 1, the MSA
                 lation accuracy. Moreover, since available MSA       model has resulted in BLEU score of 34.8, 2.6,
                 data are always larger than dialectal data, the sys- and 18.7 on MSA, dialectal, and MSA+dialectal
                 temshould not be biased to MSA.                      testing sets. It is clear that the MSA model per-
                   In this paper, we propose training three differ-   forms poorly on dialectal Arabic data. Using di-
                 ent Arabic SMT models. One model for MSA,            alectal Arabic model, the results were 4.1, 15.9,
                 another system for Dialectal Arabic, and the last    and 10.0 on MSA, dialectal, and MSA+dialectal
                 one is a hybrid model that is trained with a data    respectively. It is clear that the dialectal model
                 pool of parallel Arabic-English for MSA and di-      performs better on dialectal data, and performs
                 alectal Arabic. A pre-classifier is built to choose   poorly with MSA data. The hybrid model has re-
                 the appropriate model to be used.                    sulted in a better acceptable accuracy across both
                                                                      MSA and dialectal Arabic.      The hybrid model
                 2   Translation Models                               has resulted in 33.2, 12.3, and 22.8 BLEU for
                 Throughout this work, all translation models were    MSA, dialectal, and MSA+dialectal respectively.
                 built using Giza Aligner and Moses SMT engine        The hybrid model seemed to be a little bit bi-
                 (Philipp et al., 2007). Three translation models     ased towards MSA as the relative decrease in the
                 havebeencreated: MSA-Englishmodel,dialectal-         accuracy was -4.6% relative the MSA baseline
                 English model, and hybrid-English model.       To    model, and -22.6% relative to the dialectal base-
                 train the MSA-English translation model, a par-      line model.
                 allel dataset of 26M words was utilized from the       Translation          Parallel data type
                 ISI Arabic-English Automatically Extracted Par-          model      MSA Dialect.       MSA+Dialect.
                 allel Text corpus (Dragos and Daniel, 2007). An          MSA         34.8      2.6          18.7
                 independent MSA-English evaluation set of 300K          Dialectal     4.1     15.9          10.0
                 words was used to tune the model.       A MSA-           Hybrid      33.2     12.3          22.8
                 English test set of 300K words is used to evaluate
                 MSA-Englishtranslation accuracy.                     Table 1: BLEU score for the different SMT sys-
                   Totrain the dialectal-English translation model,   tems on MSA, dialectal, and MSA+dialectal data.
                 a parallel dataset of 2.7M words was utilized from
                 the Arabic-Dialect/English Parallel Text corpus      3   Classification-Based Translation
                 (Technologies et al., 2012) (notice the huge dif-
                 ference between the size of available MSA and        Although before adding the classifier, MSA and
                 dialectal data). An independent dialectal-English    Dialectal Arabic-English SMT systems accuracy
                 evaluation set of 300K words was used to tune the    were poor across the different variants, the hybrid
                 model. A dialectal-English test set of 300K words    systemthatwastrainedwithbothMSAanddialec-
                 is used to evaluate MSA-English translation accu-    tal data has resulted in better accuracy. However,
                 racy.                                                the aim of the Classification-Based Translation is
                   The hybrid translation model has been trained      to further improve the accuracy across both dialec-
                 by pooling both training sets of MSA and dialec-     tal and MSA, and to overcome the bias problem of
                 tal parallel data that consists of 26M MSA words     the hybrid model.
                 and 2.7M dialectal words. Model tuning was per-         Two classification techniques have been used,
                 formed using the two evaluation sets of MSA and      the first technique is to classify input Arabic text
                 dialectal Arabic.                                    into two classes Standard and Dialectal, and ac-
                   Astatistical tri-gram language model is trained    cordingly translate them with the appropriate sys-
                 for English.  Language model training set con-       tem. ThesecondtechniqueistoclassifyinputAra-
                 sists of 688M words from 2011 and 2012 articles      bic text into three classes Standard, Hybrid and
                 (News Crawl) that is described in (Sofia, 2013).      Dialectal, and then use the appropriate system ac-
                                                                   87
                 cordingly.                                              ments which have scored more than the threshold
                    A tri-gram MSA language model is built for           -3.7. The second group contains the dialectal Ara-
                 the sake of classification. More than 355M words         bic classified segments which have scored below
                 from the Arabic Gigaword corpus (Parker et al.,         or equal the score threshold -3.7.
                 2011) were used to train a MSA language model.            After that, each classified Arabic text file was
                    The MSA language model is used in text clas-         translated with the corresponding SMT system,
                 sification by scoring every input sentence by the        and then all translations were evaluated with the
                 language model. Sentences with high log likeli-         BLEUscoretest.
                 hood are classified as MSA, whilst sentences with
                 low log likelihood are classified as Dialectal.          3.2   SecondClassification Technique
                 3.1   First Classification Techniques                    In this technique, instead of having a sharp thresh-
                                                                         old between MSA and dialectal classes, we have
                 In the techniques, text segments are classified into     created a windowwiththeoptimalthresholdinthe
                 twocategories: MSA or dialectal. Two-passes op-         middle. Any sentence with a score that lies in this
                 timizationsearchwasmadetofindtheoptimallan-              window is classified with a third class. That class
                 guage model scoring threshold between MSA and           is labeled the mixture class.   It is assumed that
                 dialectal classes.                                      anysentenceinthisclass (very close to the thresh-
                    In the first pass, a coarse search was performed      old) might contain a mixture of dialectal and MSA
                 by varying classification threshold from 0.0 to -        words, which is a common case on social media
                 10.0 with a coarse step of 1.0. For each iteration,     for instance. The optimal window range has been
                 classification accuracy is evaluated. The initial op-    found to be from -2.7 to -5.45. The three classes
                 timal threshold was found to be -4.0 which has re-      in this case are: Dialectal, MSA, and mixture.
                 sulted in classification accuracy of 95.58% on the         The test set is classified into three file groups,
                 evaluation sets of MSA and dialectal Arabic.            the first group contains MSA sentences, which has
                    In the second optimization pass, a fine step          scored more than the window upper bound -2.7,
                 search was performed around the initial -4.0            the second group has the Hybrid Arabic classi-
                 threshold with a variable value of -3.0 to -5.0 with    fiedsentences, whichhasscorewithinthewindow
                 a step of 0.1. Figure 1 shows classifier’s accuracy      from -2.7 to -5.45, the third group has the Dialec-
                 test with a fine step of 0.1 (x = x−0.1). Asshown        tal classified sentences, which has scored less than
                 in the graph, the optimal threshold is -3.7 which       the window lower bound -5.45.
                 has resulted in classification accuracy of 96.64%.         After that, each classified Arabic text file was
                 Thus, threshold of -3.7 has been used.                  translated with the corresponding SMT system,
                                                                         and then all translations were evaluated with the
                                                                         BLEUscoretest.
                                                                         4   Experimental Results
                                                                         The two classification-based translation tech-
                                                                         niques have been tested on a test set that combines
                                                                         bothtestingsetsofMSA(300Kwords)anddialec-
                                                                         tal Arabic (300K words).
                                                                           Thefirst classification technique has resulted in
                                                                         a BLEUtranslation accuracy of 29.1 absolute out-
                                                                         performing the hybrid model with a relative in-
                                                                         crease in the accuracy of 27.6% as shown in Table
                                                                         2.
                 Figure 1: Fine tuning graph for MSA/dialectal             Thesecondclassificationtechniquehasresulted
                 classification threshold.                                in a BLEU translation accuracy of 29.0 absolute
                                                                         outperforming the hybrid model with a relative in-
                    In this technique, the classifier works on classi-    crease of 27.2%.
                 fying the test set and generating two file groups,         AsshowninTable 2, both techniques have sig-
                 the first group contains the MSA classified seg-          nificantly improved translation accuracy in com-
                                                                      88
                  parison to all of the three baseline systems. This         Stefan Munteanu Dragos and Marcu Daniel. 2007.
                  means that introducing a pre-classification stage              ISI Arabic-English Automatically Extracted Parallel
                  might be a helpful step in improving the perfor-              Text LDC2007T08. Web Download. Philadelphia:
                  manceofArabicmachinetranslation systems.                      Linguistic Data Consortium.
                     The BLEU score is slightly better in the first           Mohamed Elmahdy, Rainer Gruhn, and Wolfgang
                  classification technique than the second one with              Minker. 2012. Novel Techniques for Dialectal Ara-
                  an absolute difference of 0.1. This implies that it           bic Speech Recognition. Springer-Verlag New York,
                  is enoughtoclassifyinputArabictextintojusttwo                 1 edition.
                  categories instead of three.                               Young-Suk Lee. 2004. Morphological analysis for sta-
                                                                                tistical machine translation. In Proceedings of HLT-
                           Technique          BLEU Relative                     NAACL 2004: Short Papers. Association for Com-
                                                                                putational Linguistics, pages 57–60.
                         Hybrid Model          22.8     baseline             Parker, Robert, et al. 2011.   Arabic Gigaword Fifth
                       Classifier-based 1       29.1     +27.6%                  Edition (LDC2011T11). Linguistic Data Consor-
                       Classifier-based 2       29.0     +27.2%                  tium.
                  Table 2: Translation accuracy on MSA+dialectal             KoehnPhilipp, Hoang Hieu, et al. 2007. Moses: Open
                  parallel data for the hybrid model, classifier-based           Source Toolkit for Statistical Machine Translation.
                                                                                Annual Meeting of the Association for Computa-
                  technique 1, and classifier-based technique 2.                 tional Linguistics (ACL).
                  5    Conclusions                                           Wael Salloum and Nizar Habash. 2013. Dialectal Ara-
                                                                                bic to English machine translation: Pivoting through
                                                                                modern standard Arabic.     In HLT-NAACL. pages
                  This paper has focused mainly on enhancing the                348–358.
                  accuracy of SMT across MSA and dialectal Ara-              HassanSawaf.2010. Arabicdialect handling in hybrid
                  bic. Three baseline Arabic-English SMT systems                machine translation. In Proceedings of the confer-
                  were built: MSA, dialectal, and Hybrid. MSA                   ence of the association for machine translation in
                  system resulted in significantly low accuracy on               the americas (amta), denver, colorado.
                  dialectal data, whilst dialectal system resulted in        Sofia. 2013.    News Crawl (articles from 2011 and
                  low accuracy on MSA data. The hybrid system                   2012). web Download. Shared Task:          Machine
                  performed with a better average accuracy across               Translation.
                  both MSAanddialectal data.                                 Abdelhadi Soudi, Ali Farghaly, Gunter Neumann, and
                     In order to classify input text into the correct           Rabih Zbib. 2012. Challenges for Arabic Machine
                  variety of Arabic (dialectal or MSA), two classi-             Translation. Natural Language Processing 9. Ben-
                  fication techniques have been proposed. The first               jamins, John.
                  technique classifies the testing data into two cate-        Raytheon BBN Technologies, Linguistic Data Con-
                  gories, one to be translated with the MSA model,              sortium, and Sakhr Software. 2012.          Arabic-
                  and the other to be translated with the dialectal             Dialect/English Parallel Text (LDC2012T09). Lin-
                  model. The second technique classifies the testing             guistic Data Consortium.
                  data into three classes, one to be translated with         Rabih Zbib, Erika Malchiodi, Jacob Devlin, David
                  the MSA model, one to be translated with the hy-              Stallard, Spyros Matsoukas, Richard Schwartz, John
                  brid model, and the last one to be translated with            Makhoul, Omar F Zaidan, and Chris Callison-
                  the dialectal model.                                          Burch. 2012. Machine translation of arabic dialects.
                     Both techniques have significantly improved                 In Proceedings of the 2012 conference of the north
                  translation accuracy on a balanced testing set that           american chapter of the association for computa-
                                                                                tional linguistics: Human language technologies.
                  containsequalamountsofMSAanddialectaldata.                    Association for Computational Linguistics, pages
                  The first technique resulted in a slightly better              49–59.
                  BLEUscorethanthesecondclassification one.
                  References
                  ArwaAlqudsi, Nazlia Omar, and Khalid Shaker. 2014.
                     Arabic machine translation: a survey. Artificial In-
                     telligence Review 42(4):549–572.
                                                                         89
The words contained in this file might help you see if this file matches what you are looking for:

...Multi lingual phrase based statistical machine translation for arabic english ahmedbastawisy and mohamedelmahdy computerscience department germanuniversity in cairo egypt ahmed bastawisy student guc edu eg mohamed elmahdy abstract people msa is used news broadcasts newspapers formal speech books movies subti this paper we implement a tling whenever the target audience or readers comefromdifferent nationalities however smt system trans not natural language everyday life com lation text can be categorized munications on social networks fact di into standard dialectal these alectal usually case two forms of differ signicantly major problem all lan different mono guage processing tasks particular statis hybrid approaches are compared tical existence systems do always result dialects there exist signicant syn better accuracy one ara tactic morphological lexical differences bic form poor other tweenmsaandthedifferentarabicdialects that smtmodelsthataretrained why they sometimes considered as...

no reviews yet
Please Login to review.