jagomart
digital resources
picture1_Pdf Hindi Translation 99012 | Ijcnlp13 News Mt Demo


 143x       Filetype PDF       File size 0.32 MB       Source: www.cse.iitb.ac.in


File: Pdf Hindi Translation 99012 | Ijcnlp13 News Mt Demo
makingheadlinesinhindi automaticenglishtohindi newsheadlinetranslation 1 2 2 2 2 aditya joshi kashyappopat shubhamgautam pushpakbhattacharyya 1iitb monash research academy iit bombay 2dept of computer science and engineering iit bombay adityaj kashyap shubhamg ...

icon picture PDF Filetype PDF | Posted on 21 Sep 2022 | 3 years ago
Partial capture of text on file.
                             MakingHeadlinesinHindi: AutomaticEnglishtoHindi
                                                 NewsHeadlineTranslation
                                 1,2                    2                         2                                2
                   Aditya Joshi        KashyapPopat          ShubhamGautam             PushpakBhattacharyya
                                          1IITB-Monash Research Academy, IIT Bombay
                                    2Dept. of Computer Science and Engineering, IIT Bombay
                                {adityaj,kashyap,shubhamg,pb}@cse.iitb.ac.in
                                    Abstract                          of words by replacing translation of a word
                     News     headlines     exhibit   stylistic       with the most frequently co-occurring translation
                     peculiarities. The goal of our translation       candidate.  This paper is organized as follows.
                     engine ‘Making Headlines in Hindi’               Section 2 presents challenges of translating
                     is to achieve automatic translation of           news headlines.     Section 3 describes the UI
                     English news headlines to Hindi while            layout.   Section 4 discusses technical details
                     retaining the Hindi news headline styles.        of the modified translation unit while section 5
                     There are two central modules of our             describes the post-processing module that uses co-
                     engine:   the modified translation unit           occurrence-based replacement of words. Finally,
                     based on Moses and a co-occurrence-              Section 6 presents an evaluation of the engine
                     based post-processing unit. The modified          while section 7 concludes our work.
                     translation unit provides two machine            2   Challenges of News Headline
                     translation (MT) models:    phrase-based             Translation
                     and factor-based (both using in-domain           Hindi news headlines have stylistic features that
                     data). In addition, a co-occurrence-based        pose challenges to translation as follows:
                     post-processing option may be turned
                     on by a user.     Our evaluation shows             1. S-V-O order: Hindi news headlines often
                     that this engine handles some linguistic              follow the S-V-O order as opposed to S-
                     phenomena observed in Hindi news                      O-V as commonly seen in Hindi sentences.
                     headlines.                                            A common news headline is ‘ab EthAw
                                                                           jl m\ Eb-кV bnAe\g cOVAlA (ab tihaaD
                 1   Introduction                                                      
                                                                           jel mein biskooT banayenge chauTala;
                 ‘Making Headlines in Hindi’ is a web-based                Now Chautala will make biscuits in Tihar
                 translation engine for English to Hindi news              jail)’ where the verb ‘bnAe\g (banayenge;
                 headline  translation.    Hindi1   is  a widely           will make)’ preceeds the object ‘cOVAlA
                 spoken Indian language and has several news               (chauTala; Chautala)’.
                 publications. The aim of our translation engine        2. Numbers for people: Use of numbers to
                 is to translate English news headlines to Hindi           indicate a group of people, like in the case of
                 preserving the content as well as Hindi news              English news headlines, is also common in
                 headline structure to the extent possible.   The          Hindinewsheadlines. Forexample,theword
                                            2
                 engine is based on Moses and has two central              ‘Five’ in ‘Five held for molesting woman’
                 parts:   modified translation unit and a co-               stands for five people.
                 occurrence based post-processing unit.       The
                 modified translation unit consists of phrase-           3. Preferred choice of words: Words that are
                 based MT (Koehn et al., 2003)) and factor-                commonly used in news headlines are often
                 based MT (Koehn et al., 2007). The automatic              different from accurate translations.   For
                 post-processing module performs co-occurrence-            example, ‘RBI’ (abbreviation for ‘Reserve
                 based replacement for correct sense translation           Bank of India’) is common in English news
                    1https://en.wikipedia.org/wiki/Hindi                   headlines - however, instead of using its
                    2http://www.statmt.org/moses/                          transliterated form, news headlines tend to
                       translate it to ‘Er)v
    b{\к  (rizarv bank;     (b) A color-coded alignment table in case the
                       Reserve Bank)’ in Hindi news headlines.                option to display the alignment table : This
                                                                              helps to understand how each word got
                    4. Missingverbs: Often,verbsarealsodropped                translated and then reordered.
                       as in the case of ‘mhAк\B m\ a)b-g)b
                                                                        (c) Time taken for translation
                       s\to кF BFw (mahakumbhmeinajab-gajab
                       santon kii bheeD; Herds of fascinating saints        Figure 1 shows a snapshot of the UI. Moses-
                       in Mahakumbh (fair))’ where a form of the         Baseline indicates the naive translation engine
                       word‘be’ has been dropped.                        while Moses-MLM-Dict is the modified phrase
                                                                         model.
                                                                         4    ModifiedTranslationUnit
                                                                         Weimplemented two translation models: phrase-
                                                                         based and factor-based.       The training corpus
                                                                         consisted of parallel corpus obtained from (a)
                                                                                     3
                                                                         Gyan-nidhi     consisting of 2,27,123 sentences
                                                                                                    4
                                                                         and (b) Mahashabdkosh        consisting of 46,825
                                                                         judicial  sentences.      To transliterate out-of-
                                                                         vocabulary words, we modified transliteration
                                                                         engine provided by Chinnakotla et al. (2010). The
                                                                         original transliteration was trained for Hindi to
                                                                         English transliteration.   For the purpose of our
                                                                         engine, we re-trained this model for English to
                                                                         Hindi transliteration. This section describes each
                  Figure1: MakingHeadlinesinHindi: Snapshotof            of these components.
                  Output                                                 4.1   Phrase-based Model
                  3   UILayout                                           The Phrase-based MT model was trained using
                                                                         Mosesby(Koehnetal.,2007). Inordertoimprove
                  The interface of the engine is divided into two        the quality of translation, we modify different
                  vertical blocks for clarity:   one for input and       componentsofthemodelintwoways. Topreserve
                  another for output. The input to the translation       sentence order, we use a modified language
                  engine consists of:                                    model-alanguagemodeltrainedusingin-domain
                                                                         data consisting of 20,220 news headlines from
                                                                                              5
                  (a) Text area for English news headline(s),            BBCHindiwebsite and2,02,335newsheadlines
                                                                                               6
                                                                         from Dainik Bhaskar archives of 2010 and 2011.
                  (b) OptiontoselectPhrase-basedv/sFactor-based          The fact that this modified language model is a
                      model,                                             better fit to the target data is highlighted by the
                                                                         perplexity value obtained using SRILM toolkit
                  (c) Checkboxes      for    co-occurrence     based     by (Stolcke, 2002). For bi-grams, the perplexity
                      replacement,     transliteration   for   OOVs      of the Dainik Bhaskar corpus with a test news
                      and displaying alignment table for the output:     headline corpus was 434.06 while the perplexity
                      Each of these options can be turned on/off.        of corpus consisting of tourism documents was
                                                                         1205.58. Similar trend was observed in case of
                    While one out of the two options in (b) must         tri-grams. To enrich the translation mapping table
                  be selected, check-boxes in (c) are optional. Each     available, we added a bilingual dictionary to the
                  of the components stated above are described in        parallel corpus used for training the translation
                  Section 4.                                                3http://www.cdacnoida.in/snlp/digital library/gyan nidhi.asp
                    Theoutput consists of:                                  4http://www.e-mahashabdkosh.cdac.in/
                                                                            5http://www.bbc.co.uk/hindi/
                  (a) The best five translations obtained in Hindi           6http://www.bhaskar.com/
                  model. This bilingual dictionary was downloaded         molestation charge organized on crpf jawan)’.
                                                7
                  from CFILT, IIT Bombay .           This dictionary      The word ‘held’ gets translated to ‘aAyoEjt
                  containsatotalof1,28,240mappingsandincludes             (aayojit;  organized/conducted)’ as opposed to
                  words as well as phrases.       The fact that this      ‘EgrtAr (giraftar; arrested)’.       The language
                  dictionary enriches translations is observed in the     model relies on n-grams and hence, does not
                  case of a news headline containing the word             take into account the correct sense of words in
                  ‘catch-22’.   This word does not occur in the           cases where the words do not occur together. For
                  parallel news headlines. However, it gets correctly     this purpose, we implemented a post-processing
                  translatedto‘jEVl (jaTil)’accordingtotheentry           strategy that considers co-occurrence statistics of
                  in the dictionary.                                      a target word with all other words in the sentence
                  4.2   Factor-based Model                                to find the best sense translation.       In case of
                                                                          the above example, using the co-occurrences in a
                  OurFactor-based MTmodelusesasetoffactors                newsheadlinecorpus,weselectthesenseof‘held’
                  along with words for translation. The factors used      in Hindi which occurs most frequently with other
                  on source and target side are as follows.               words and replace the word with this translation.
                  1) On the source side, we use POS, lemma,               We do not consider co-occurrence statistics for
                  tense and number.     The POS tags are obtained         function words.     We understand that the above
                                              8
                  from Stanford POS tagger while the lemma are            strategy does not work in the case of inflected
                                                         9
                  obtained from MIT Wordnet stemmer . Tense and           forms of words in Hindi.
                  numberarederived from POS tags.                         6    Evaluation
                  2) On the target side, we use CFILT hybrid POS
                        10
                  tagger   to obtain POS tags.                            We evaluated the engine using a test set of
                  The factors are combined using options available        787 headlines downloaded from the website of a
                  in Moses. The lemma, tense and number on the                                                11
                                                                          popular English daily, The Hindu       and manually
                  source side generate the translated word on the         translated into Hindi by native speakers.          A
                  target side. On the target side, words generate POS     BLEUscoreof13.40isobtainedforphrase-based
                  features. By generating best possible translations      MT and 5.73 for factor-based MT. In order to
                  using a POS-based target language model, we             understand how the engine performs for different
                  hope to obtain translations in a POS order best         kinds of linguistic phenomena, we also performed
                  suited to the news headline domain.                     a qualititative evaluation of the output.        The
                  5   Post-processing: Co-occurrence-based                following are examples of output from our engine.
                      Replacement                                         They handle different linguistic phenomena as
                                                                          follows:
                  The engine provides an optional co-occurrence             1. Ambiguity:
                  based replacement strategy to post-process the                 Input: Industrialist remembered
                  output.   A manual evaluation showed that 14                   Output: uogpEt кo yAd EкyA
                  out of 50 headlines were incorrect because of                  (udyogpati ko yaad kiyaa;
                  incorrect sense of one or more words.             To           Industrialist remembered)
                  overcome this problem, we implemented a post-                 The input sentence in domains other than
                  processing strategy that automatically edits output           newsheadlinesisambiguousasitcouldmean
                  obtained from the MT model using co-occurrence                that either an industrialist was remembered
                  statistics as found in the in-domain news headline            or an industrialist remembered something. In
                  corpus.   To elaborate how this works, consider               case of news headline, however, the former
                  the English news headline ‘crpf jawan held on                 holds true. This is correctly reflected in the
                  molestation charge’. The translation obtained was             Hindi translation.
                  ‘sFaArpFe' jvAn pr aAyoEjt u(pFwn                         2. S-V-O order:
                  cAj
 (crpf jawaan par aayojit utpiDan chaarj;                  Input: Now Jaganmohan will make biscuits in jail
                     7                                                           Output: ab jgmohn кr\g Eb-кV jl m\
                      http://www.cfilt.iitb.ac.in                                                               
                     8http://www-nlp.stanford.edu/software/tagger.shtml          (ab jaganmohan karenge biskoot jel mein;
                     9http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/         NowJaganmohanwilldobiscuitsinjail)
                  morph/WordnetStemmer.html
                    10http://www.cfilt.iitb.ac.in/Tools.html                  11www.thehindu.com
                     The verb ‘will do’ gets placed correctly in         understand that the ‘no’ gets missed out in
                     the target sentence thus preserving the verb        the translation.
                     order. However,thetranslation‘karenge(will     7   Conclusion & Future Work
                     do)’ is incorrect and must be ‘banaange (will
                     make)’.                                        We presented ‘Making headlines in Hindi’, a
                  3. Numbersforpeople:                              translation engine that aims to translate English
                      Input: Five killed in bomb blast              news headlines to Hindi while preserving news
                      Output: pA\c bm Ev-PoV m\ mAr gy           headline styles in the target language.     Our
                      (paanch bum visfot mein maare gaye;           engine includes a phrase-based model and a
                      Five killed in bomb blast)                    factor-based model. The phrase-based model uses
                     The output sentence is a perfect translation   an in-domain language model and a bilingual
                     and correctly translates ‘five’ as ‘paanch’.    dictionary. The factor-based model uses factors
                     However, the news headline order is not        like POS, lemma, tense and number. In addition,
                     retained in this case.                         we also described our post-processing strategy
                                                                    that performs co-occurrence-based replacement of
                  4. Missing verbs:                                 words to obtain correct sense of target language
                      Input: Veteran journalist dead                words.    An evaluation of the output of our
                      Output: Ed`gj p/кAr mt                        translation engine shows that it performs well
                                                                   for many linguistic styles used in Hindi news
                      (diggaj patrakaar mrut;                       headlines.
                      Veteran journalist dead)                         The co-occurrence-based strategy is naive. As
                     The output sentence is a perfect translation
                     althoughaformof‘be’isabsentinthesource         a future work, co-occurrence-based strategy can
                     sentence.                                      be improved to incorporate inflections of words.
                                                                    Also, other approaches to improve translation
                  5. Translation of idioms:                         quality may be considered.
                      Input: Croatia and Serbia bury the hatchet
                      Output: ‡oEVyA aOr sEb
yA )gwA tm кrnA       References
                      (kroatia aur serbia jhagDa khatam karna;
                      Croatia and Serbia do-end-quarrel)            Manoj Kumar Chinnakotla, Om P. Damani and Avijit
                     The idiom ‘bury the hatchet’ gets correctly       Satoskar. 2010. Transliteration for Resource-Scarce
                     translated to ‘)gwA tm кrnA; jhagDa              Languages.  Proc. ACM Trans. Asian Lang. Inf.
                     khatam karna; to end a quarrel’ as a              Process.,
                     complete entity.  This is a direct mapping     Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
                     from the bilingual dictionary and does not        Callison-Burch, Marcello Federico, Nicola Bertoldi,
                     have the correct inflection.                       Brooke Cowan, Wade Shen, Christine Moran,
                                                                       Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
                  6. Sense correction due to co-occurrence             Constantin, Evan Herbst. 2007.  Moses: Open
                                                                       Source Toolkit for Statistical Machine Translation
                     based replacement:                                Proc. of ACL 2007, demonstration session, Prague,
                      Input: No hike in AMU tuition fees               Czech Republic
                      Moses-MLM-Dict: amua@yApn 'Fs m\ кoи vEˆ
                                                                   Philipp Koehn and Hieu Hoang.    2007.   Factored
                      nhF\                                             Translation Models. Proc. of EMNLP-CoNLL 2007,
                      (amuadhyaapanfeesmeinkoipad-yaatra;              Prague, Czech Republic
                      hike (trek) in AMU tuition fees)              Philipp KoehnandFranzJosefOchandDanielMarcu,.
                      Moses-CoOcc: amuEш"Z "/ m\ кoи vEˆ nhF\
                                                                      2003. Statistical phrase-based translation Proc. of
                      (amushikshan fees mein koi vriddhi;              NAACL2003,Edmonton,Canada
                      hike (increase) in AMU tuition fees)          A. Stolcke. 2002. SRILM - An extensible language
                     We observe that our post-processing unit          modelingtoolkit. Proc.InternationalConferenceon
                     improves the output in some cases.      The       Spoken Language Processing, vol. 2
                     original output translates ‘hike’ as ‘pdyA/A
                     (pad-yaatra ; hike)’.   The co-occurrence-
                     basedreplacementunitidentifiesandcorrects
                     the sense to ‘vEˆ (vriddhi; increase)’. We
                                    
The words contained in this file might help you see if this file matches what you are looking for:

...Makingheadlinesinhindi automaticenglishtohindi newsheadlinetranslation aditya joshi kashyappopat shubhamgautam pushpakbhattacharyya iitb monash research academy iit bombay dept of computer science and engineering adityaj kashyap shubhamg pb cse ac in abstract words by replacing translation a word news headlines exhibit stylistic with the most frequently co occurring peculiarities goal our candidate this paper is organized as follows engine making hindi section presents challenges translating to achieve automatic describes ui english while layout discusses technical details retaining headline styles modied unit there are two central modules post processing module that uses occurrence based replacement finally on moses an evaluation concludes work provides machine mt models phrase factor both using domain have features data addition pose option may be turned user shows s v o order often handles some linguistic follow opposed phenomena observed commonly seen sentences common ab ethaw j l ...

no reviews yet
Please Login to review.