143x Filetype PDF File size 0.32 MB Source: www.cse.iitb.ac.in
MakingHeadlinesinHindi: AutomaticEnglishtoHindi NewsHeadlineTranslation 1,2 2 2 2 Aditya Joshi KashyapPopat ShubhamGautam PushpakBhattacharyya 1IITB-Monash Research Academy, IIT Bombay 2Dept. of Computer Science and Engineering, IIT Bombay {adityaj,kashyap,shubhamg,pb}@cse.iitb.ac.in Abstract of words by replacing translation of a word News headlines exhibit stylistic with the most frequently co-occurring translation peculiarities. The goal of our translation candidate. This paper is organized as follows. engine ‘Making Headlines in Hindi’ Section 2 presents challenges of translating is to achieve automatic translation of news headlines. Section 3 describes the UI English news headlines to Hindi while layout. Section 4 discusses technical details retaining the Hindi news headline styles. of the modified translation unit while section 5 There are two central modules of our describes the post-processing module that uses co- engine: the modified translation unit occurrence-based replacement of words. Finally, based on Moses and a co-occurrence- Section 6 presents an evaluation of the engine based post-processing unit. The modified while section 7 concludes our work. translation unit provides two machine 2 Challenges of News Headline translation (MT) models: phrase-based Translation and factor-based (both using in-domain Hindi news headlines have stylistic features that data). In addition, a co-occurrence-based pose challenges to translation as follows: post-processing option may be turned on by a user. Our evaluation shows 1. S-V-O order: Hindi news headlines often that this engine handles some linguistic follow the S-V-O order as opposed to S- phenomena observed in Hindi news O-V as commonly seen in Hindi sentences. headlines. A common news headline is ‘ab EthAw jl m\ Eb-кV bnAe\g cOVAlA (ab tihaaD 1 Introduction jel mein biskooT banayenge chauTala; ‘Making Headlines in Hindi’ is a web-based Now Chautala will make biscuits in Tihar translation engine for English to Hindi news jail)’ where the verb ‘bnAe\g (banayenge; headline translation. Hindi1 is a widely will make)’ preceeds the object ‘cOVAlA spoken Indian language and has several news (chauTala; Chautala)’. publications. The aim of our translation engine 2. Numbers for people: Use of numbers to is to translate English news headlines to Hindi indicate a group of people, like in the case of preserving the content as well as Hindi news English news headlines, is also common in headline structure to the extent possible. The Hindinewsheadlines. Forexample,theword 2 engine is based on Moses and has two central ‘Five’ in ‘Five held for molesting woman’ parts: modified translation unit and a co- stands for five people. occurrence based post-processing unit. The modified translation unit consists of phrase- 3. Preferred choice of words: Words that are based MT (Koehn et al., 2003)) and factor- commonly used in news headlines are often based MT (Koehn et al., 2007). The automatic different from accurate translations. For post-processing module performs co-occurrence- example, ‘RBI’ (abbreviation for ‘Reserve based replacement for correct sense translation Bank of India’) is common in English news 1https://en.wikipedia.org/wiki/Hindi headlines - however, instead of using its 2http://www.statmt.org/moses/ transliterated form, news headlines tend to translate it to ‘Er)v b{\к (rizarv bank; (b) A color-coded alignment table in case the Reserve Bank)’ in Hindi news headlines. option to display the alignment table : This helps to understand how each word got 4. Missingverbs: Often,verbsarealsodropped translated and then reordered. as in the case of ‘mhAк\B m\ a)b-g)b (c) Time taken for translation s\to кF BFw (mahakumbhmeinajab-gajab santon kii bheeD; Herds of fascinating saints Figure 1 shows a snapshot of the UI. Moses- in Mahakumbh (fair))’ where a form of the Baseline indicates the naive translation engine word‘be’ has been dropped. while Moses-MLM-Dict is the modified phrase model. 4 ModifiedTranslationUnit Weimplemented two translation models: phrase- based and factor-based. The training corpus consisted of parallel corpus obtained from (a) 3 Gyan-nidhi consisting of 2,27,123 sentences 4 and (b) Mahashabdkosh consisting of 46,825 judicial sentences. To transliterate out-of- vocabulary words, we modified transliteration engine provided by Chinnakotla et al. (2010). The original transliteration was trained for Hindi to English transliteration. For the purpose of our engine, we re-trained this model for English to Hindi transliteration. This section describes each Figure1: MakingHeadlinesinHindi: Snapshotof of these components. Output 4.1 Phrase-based Model 3 UILayout The Phrase-based MT model was trained using Mosesby(Koehnetal.,2007). Inordertoimprove The interface of the engine is divided into two the quality of translation, we modify different vertical blocks for clarity: one for input and componentsofthemodelintwoways. Topreserve another for output. The input to the translation sentence order, we use a modified language engine consists of: model-alanguagemodeltrainedusingin-domain data consisting of 20,220 news headlines from 5 (a) Text area for English news headline(s), BBCHindiwebsite and2,02,335newsheadlines 6 from Dainik Bhaskar archives of 2010 and 2011. (b) OptiontoselectPhrase-basedv/sFactor-based The fact that this modified language model is a model, better fit to the target data is highlighted by the perplexity value obtained using SRILM toolkit (c) Checkboxes for co-occurrence based by (Stolcke, 2002). For bi-grams, the perplexity replacement, transliteration for OOVs of the Dainik Bhaskar corpus with a test news and displaying alignment table for the output: headline corpus was 434.06 while the perplexity Each of these options can be turned on/off. of corpus consisting of tourism documents was 1205.58. Similar trend was observed in case of While one out of the two options in (b) must tri-grams. To enrich the translation mapping table be selected, check-boxes in (c) are optional. Each available, we added a bilingual dictionary to the of the components stated above are described in parallel corpus used for training the translation Section 4. 3http://www.cdacnoida.in/snlp/digital library/gyan nidhi.asp Theoutput consists of: 4http://www.e-mahashabdkosh.cdac.in/ 5http://www.bbc.co.uk/hindi/ (a) The best five translations obtained in Hindi 6http://www.bhaskar.com/ model. This bilingual dictionary was downloaded molestation charge organized on crpf jawan)’. 7 from CFILT, IIT Bombay . This dictionary The word ‘held’ gets translated to ‘aAyoEjt containsatotalof1,28,240mappingsandincludes (aayojit; organized/conducted)’ as opposed to words as well as phrases. The fact that this ‘EgrtAr (giraftar; arrested)’. The language dictionary enriches translations is observed in the model relies on n-grams and hence, does not case of a news headline containing the word take into account the correct sense of words in ‘catch-22’. This word does not occur in the cases where the words do not occur together. For parallel news headlines. However, it gets correctly this purpose, we implemented a post-processing translatedto‘jEVl (jaTil)’accordingtotheentry strategy that considers co-occurrence statistics of in the dictionary. a target word with all other words in the sentence 4.2 Factor-based Model to find the best sense translation. In case of the above example, using the co-occurrences in a OurFactor-based MTmodelusesasetoffactors newsheadlinecorpus,weselectthesenseof‘held’ along with words for translation. The factors used in Hindi which occurs most frequently with other on source and target side are as follows. words and replace the word with this translation. 1) On the source side, we use POS, lemma, We do not consider co-occurrence statistics for tense and number. The POS tags are obtained function words. We understand that the above 8 from Stanford POS tagger while the lemma are strategy does not work in the case of inflected 9 obtained from MIT Wordnet stemmer . Tense and forms of words in Hindi. numberarederived from POS tags. 6 Evaluation 2) On the target side, we use CFILT hybrid POS 10 tagger to obtain POS tags. We evaluated the engine using a test set of The factors are combined using options available 787 headlines downloaded from the website of a in Moses. The lemma, tense and number on the 11 popular English daily, The Hindu and manually source side generate the translated word on the translated into Hindi by native speakers. A target side. On the target side, words generate POS BLEUscoreof13.40isobtainedforphrase-based features. By generating best possible translations MT and 5.73 for factor-based MT. In order to using a POS-based target language model, we understand how the engine performs for different hope to obtain translations in a POS order best kinds of linguistic phenomena, we also performed suited to the news headline domain. a qualititative evaluation of the output. The 5 Post-processing: Co-occurrence-based following are examples of output from our engine. Replacement They handle different linguistic phenomena as follows: The engine provides an optional co-occurrence 1. Ambiguity: based replacement strategy to post-process the Input: Industrialist remembered output. A manual evaluation showed that 14 Output: uogpEt кo yAd EкyA out of 50 headlines were incorrect because of (udyogpati ko yaad kiyaa; incorrect sense of one or more words. To Industrialist remembered) overcome this problem, we implemented a post- The input sentence in domains other than processing strategy that automatically edits output newsheadlinesisambiguousasitcouldmean obtained from the MT model using co-occurrence that either an industrialist was remembered statistics as found in the in-domain news headline or an industrialist remembered something. In corpus. To elaborate how this works, consider case of news headline, however, the former the English news headline ‘crpf jawan held on holds true. This is correctly reflected in the molestation charge’. The translation obtained was Hindi translation. ‘sFaArpFe' jvAn pr aAyoEjt u(pFwn 2. S-V-O order: cAj (crpf jawaan par aayojit utpiDan chaarj; Input: Now Jaganmohan will make biscuits in jail 7 Output: ab jgmohn кr\g Eb-кV jl m\ http://www.cfilt.iitb.ac.in 8http://www-nlp.stanford.edu/software/tagger.shtml (ab jaganmohan karenge biskoot jel mein; 9http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/ NowJaganmohanwilldobiscuitsinjail) morph/WordnetStemmer.html 10http://www.cfilt.iitb.ac.in/Tools.html 11www.thehindu.com The verb ‘will do’ gets placed correctly in understand that the ‘no’ gets missed out in the target sentence thus preserving the verb the translation. order. However,thetranslation‘karenge(will 7 Conclusion & Future Work do)’ is incorrect and must be ‘banaange (will make)’. We presented ‘Making headlines in Hindi’, a 3. Numbersforpeople: translation engine that aims to translate English Input: Five killed in bomb blast news headlines to Hindi while preserving news Output: pA\c bm Ev-PoV m\ mAr gy headline styles in the target language. Our (paanch bum visfot mein maare gaye; engine includes a phrase-based model and a Five killed in bomb blast) factor-based model. The phrase-based model uses The output sentence is a perfect translation an in-domain language model and a bilingual and correctly translates ‘five’ as ‘paanch’. dictionary. The factor-based model uses factors However, the news headline order is not like POS, lemma, tense and number. In addition, retained in this case. we also described our post-processing strategy that performs co-occurrence-based replacement of 4. Missing verbs: words to obtain correct sense of target language Input: Veteran journalist dead words. An evaluation of the output of our Output: Ed`gj p/кAr mt translation engine shows that it performs well for many linguistic styles used in Hindi news (diggaj patrakaar mrut; headlines. Veteran journalist dead) The co-occurrence-based strategy is naive. As The output sentence is a perfect translation althoughaformof‘be’isabsentinthesource a future work, co-occurrence-based strategy can sentence. be improved to incorporate inflections of words. Also, other approaches to improve translation 5. Translation of idioms: quality may be considered. Input: Croatia and Serbia bury the hatchet Output: oEVyA aOr sEb yA )gwA tm кrnA References (kroatia aur serbia jhagDa khatam karna; Croatia and Serbia do-end-quarrel) Manoj Kumar Chinnakotla, Om P. Damani and Avijit The idiom ‘bury the hatchet’ gets correctly Satoskar. 2010. Transliteration for Resource-Scarce translated to ‘)gwA tm кrnA; jhagDa Languages. Proc. ACM Trans. Asian Lang. Inf. khatam karna; to end a quarrel’ as a Process., complete entity. This is a direct mapping Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris from the bilingual dictionary and does not Callison-Burch, Marcello Federico, Nicola Bertoldi, have the correct inflection. Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra 6. Sense correction due to co-occurrence Constantin, Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation based replacement: Proc. of ACL 2007, demonstration session, Prague, Input: No hike in AMU tuition fees Czech Republic Moses-MLM-Dict: amua@yApn 'Fs m\ кoи vE Philipp Koehn and Hieu Hoang. 2007. Factored nhF\ Translation Models. Proc. of EMNLP-CoNLL 2007, (amuadhyaapanfeesmeinkoipad-yaatra; Prague, Czech Republic hike (trek) in AMU tuition fees) Philipp KoehnandFranzJosefOchandDanielMarcu,. Moses-CoOcc: amuEш"Z "/ m\ кoи vE nhF\ 2003. Statistical phrase-based translation Proc. of (amushikshan fees mein koi vriddhi; NAACL2003,Edmonton,Canada hike (increase) in AMU tuition fees) A. Stolcke. 2002. SRILM - An extensible language We observe that our post-processing unit modelingtoolkit. Proc.InternationalConferenceon improves the output in some cases. The Spoken Language Processing, vol. 2 original output translates ‘hike’ as ‘pdyA/A (pad-yaatra ; hike)’. The co-occurrence- basedreplacementunitidentifiesandcorrects the sense to ‘vE (vriddhi; increase)’. We
no reviews yet
Please Login to review.