280x Filetype PDF File size 0.32 MB Source: www.cse.iitb.ac.in
MakingHeadlinesinHindi: AutomaticEnglishtoHindi
NewsHeadlineTranslation
1,2 2 2 2
Aditya Joshi KashyapPopat ShubhamGautam PushpakBhattacharyya
1IITB-Monash Research Academy, IIT Bombay
2Dept. of Computer Science and Engineering, IIT Bombay
{adityaj,kashyap,shubhamg,pb}@cse.iitb.ac.in
Abstract of words by replacing translation of a word
News headlines exhibit stylistic with the most frequently co-occurring translation
peculiarities. The goal of our translation candidate. This paper is organized as follows.
engine ‘Making Headlines in Hindi’ Section 2 presents challenges of translating
is to achieve automatic translation of news headlines. Section 3 describes the UI
English news headlines to Hindi while layout. Section 4 discusses technical details
retaining the Hindi news headline styles. of the modified translation unit while section 5
There are two central modules of our describes the post-processing module that uses co-
engine: the modified translation unit occurrence-based replacement of words. Finally,
based on Moses and a co-occurrence- Section 6 presents an evaluation of the engine
based post-processing unit. The modified while section 7 concludes our work.
translation unit provides two machine 2 Challenges of News Headline
translation (MT) models: phrase-based Translation
and factor-based (both using in-domain Hindi news headlines have stylistic features that
data). In addition, a co-occurrence-based pose challenges to translation as follows:
post-processing option may be turned
on by a user. Our evaluation shows 1. S-V-O order: Hindi news headlines often
that this engine handles some linguistic follow the S-V-O order as opposed to S-
phenomena observed in Hindi news O-V as commonly seen in Hindi sentences.
headlines. A common news headline is ‘ab EthAw
jl m\ Eb-кV bnAe\g cOVAlA (ab tihaaD
1 Introduction
jel mein biskooT banayenge chauTala;
‘Making Headlines in Hindi’ is a web-based Now Chautala will make biscuits in Tihar
translation engine for English to Hindi news jail)’ where the verb ‘bnAe\g (banayenge;
headline translation. Hindi1 is a widely will make)’ preceeds the object ‘cOVAlA
spoken Indian language and has several news (chauTala; Chautala)’.
publications. The aim of our translation engine 2. Numbers for people: Use of numbers to
is to translate English news headlines to Hindi indicate a group of people, like in the case of
preserving the content as well as Hindi news English news headlines, is also common in
headline structure to the extent possible. The Hindinewsheadlines. Forexample,theword
2
engine is based on Moses and has two central ‘Five’ in ‘Five held for molesting woman’
parts: modified translation unit and a co- stands for five people.
occurrence based post-processing unit. The
modified translation unit consists of phrase- 3. Preferred choice of words: Words that are
based MT (Koehn et al., 2003)) and factor- commonly used in news headlines are often
based MT (Koehn et al., 2007). The automatic different from accurate translations. For
post-processing module performs co-occurrence- example, ‘RBI’ (abbreviation for ‘Reserve
based replacement for correct sense translation Bank of India’) is common in English news
1https://en.wikipedia.org/wiki/Hindi headlines - however, instead of using its
2http://www.statmt.org/moses/ transliterated form, news headlines tend to
translate it to ‘Er)v
b{\к (rizarv bank; (b) A color-coded alignment table in case the
Reserve Bank)’ in Hindi news headlines. option to display the alignment table : This
helps to understand how each word got
4. Missingverbs: Often,verbsarealsodropped translated and then reordered.
as in the case of ‘mhAк\B m\ a)b-g)b
(c) Time taken for translation
s\to кF BFw (mahakumbhmeinajab-gajab
santon kii bheeD; Herds of fascinating saints Figure 1 shows a snapshot of the UI. Moses-
in Mahakumbh (fair))’ where a form of the Baseline indicates the naive translation engine
word‘be’ has been dropped. while Moses-MLM-Dict is the modified phrase
model.
4 ModifiedTranslationUnit
Weimplemented two translation models: phrase-
based and factor-based. The training corpus
consisted of parallel corpus obtained from (a)
3
Gyan-nidhi consisting of 2,27,123 sentences
4
and (b) Mahashabdkosh consisting of 46,825
judicial sentences. To transliterate out-of-
vocabulary words, we modified transliteration
engine provided by Chinnakotla et al. (2010). The
original transliteration was trained for Hindi to
English transliteration. For the purpose of our
engine, we re-trained this model for English to
Hindi transliteration. This section describes each
Figure1: MakingHeadlinesinHindi: Snapshotof of these components.
Output 4.1 Phrase-based Model
3 UILayout The Phrase-based MT model was trained using
Mosesby(Koehnetal.,2007). Inordertoimprove
The interface of the engine is divided into two the quality of translation, we modify different
vertical blocks for clarity: one for input and componentsofthemodelintwoways. Topreserve
another for output. The input to the translation sentence order, we use a modified language
engine consists of: model-alanguagemodeltrainedusingin-domain
data consisting of 20,220 news headlines from
5
(a) Text area for English news headline(s), BBCHindiwebsite and2,02,335newsheadlines
6
from Dainik Bhaskar archives of 2010 and 2011.
(b) OptiontoselectPhrase-basedv/sFactor-based The fact that this modified language model is a
model, better fit to the target data is highlighted by the
perplexity value obtained using SRILM toolkit
(c) Checkboxes for co-occurrence based by (Stolcke, 2002). For bi-grams, the perplexity
replacement, transliteration for OOVs of the Dainik Bhaskar corpus with a test news
and displaying alignment table for the output: headline corpus was 434.06 while the perplexity
Each of these options can be turned on/off. of corpus consisting of tourism documents was
1205.58. Similar trend was observed in case of
While one out of the two options in (b) must tri-grams. To enrich the translation mapping table
be selected, check-boxes in (c) are optional. Each available, we added a bilingual dictionary to the
of the components stated above are described in parallel corpus used for training the translation
Section 4. 3http://www.cdacnoida.in/snlp/digital library/gyan nidhi.asp
Theoutput consists of: 4http://www.e-mahashabdkosh.cdac.in/
5http://www.bbc.co.uk/hindi/
(a) The best five translations obtained in Hindi 6http://www.bhaskar.com/
model. This bilingual dictionary was downloaded molestation charge organized on crpf jawan)’.
7
from CFILT, IIT Bombay . This dictionary The word ‘held’ gets translated to ‘aAyoEjt
containsatotalof1,28,240mappingsandincludes (aayojit; organized/conducted)’ as opposed to
words as well as phrases. The fact that this ‘EgrtAr (giraftar; arrested)’. The language
dictionary enriches translations is observed in the model relies on n-grams and hence, does not
case of a news headline containing the word take into account the correct sense of words in
‘catch-22’. This word does not occur in the cases where the words do not occur together. For
parallel news headlines. However, it gets correctly this purpose, we implemented a post-processing
translatedto‘jEVl (jaTil)’accordingtotheentry strategy that considers co-occurrence statistics of
in the dictionary. a target word with all other words in the sentence
4.2 Factor-based Model to find the best sense translation. In case of
the above example, using the co-occurrences in a
OurFactor-based MTmodelusesasetoffactors newsheadlinecorpus,weselectthesenseof‘held’
along with words for translation. The factors used in Hindi which occurs most frequently with other
on source and target side are as follows. words and replace the word with this translation.
1) On the source side, we use POS, lemma, We do not consider co-occurrence statistics for
tense and number. The POS tags are obtained function words. We understand that the above
8
from Stanford POS tagger while the lemma are strategy does not work in the case of inflected
9
obtained from MIT Wordnet stemmer . Tense and forms of words in Hindi.
numberarederived from POS tags. 6 Evaluation
2) On the target side, we use CFILT hybrid POS
10
tagger to obtain POS tags. We evaluated the engine using a test set of
The factors are combined using options available 787 headlines downloaded from the website of a
in Moses. The lemma, tense and number on the 11
popular English daily, The Hindu and manually
source side generate the translated word on the translated into Hindi by native speakers. A
target side. On the target side, words generate POS BLEUscoreof13.40isobtainedforphrase-based
features. By generating best possible translations MT and 5.73 for factor-based MT. In order to
using a POS-based target language model, we understand how the engine performs for different
hope to obtain translations in a POS order best kinds of linguistic phenomena, we also performed
suited to the news headline domain. a qualititative evaluation of the output. The
5 Post-processing: Co-occurrence-based following are examples of output from our engine.
Replacement They handle different linguistic phenomena as
follows:
The engine provides an optional co-occurrence 1. Ambiguity:
based replacement strategy to post-process the Input: Industrialist remembered
output. A manual evaluation showed that 14 Output: uogpEt кo yAd EкyA
out of 50 headlines were incorrect because of (udyogpati ko yaad kiyaa;
incorrect sense of one or more words. To Industrialist remembered)
overcome this problem, we implemented a post- The input sentence in domains other than
processing strategy that automatically edits output newsheadlinesisambiguousasitcouldmean
obtained from the MT model using co-occurrence that either an industrialist was remembered
statistics as found in the in-domain news headline or an industrialist remembered something. In
corpus. To elaborate how this works, consider case of news headline, however, the former
the English news headline ‘crpf jawan held on holds true. This is correctly reflected in the
molestation charge’. The translation obtained was Hindi translation.
‘sFaArpFe' jvAn pr aAyoEjt u(pFwn 2. S-V-O order:
cAj
(crpf jawaan par aayojit utpiDan chaarj; Input: Now Jaganmohan will make biscuits in jail
7 Output: ab jgmohn кr\g Eb-кV jl m\
http://www.cfilt.iitb.ac.in
8http://www-nlp.stanford.edu/software/tagger.shtml (ab jaganmohan karenge biskoot jel mein;
9http://projects.csail.mit.edu/jwi/api/edu/mit/jwi/ NowJaganmohanwilldobiscuitsinjail)
morph/WordnetStemmer.html
10http://www.cfilt.iitb.ac.in/Tools.html 11www.thehindu.com
The verb ‘will do’ gets placed correctly in understand that the ‘no’ gets missed out in
the target sentence thus preserving the verb the translation.
order. However,thetranslation‘karenge(will 7 Conclusion & Future Work
do)’ is incorrect and must be ‘banaange (will
make)’. We presented ‘Making headlines in Hindi’, a
3. Numbersforpeople: translation engine that aims to translate English
Input: Five killed in bomb blast news headlines to Hindi while preserving news
Output: pA\c bm Ev-PoV m\ mAr gy headline styles in the target language. Our
(paanch bum visfot mein maare gaye; engine includes a phrase-based model and a
Five killed in bomb blast) factor-based model. The phrase-based model uses
The output sentence is a perfect translation an in-domain language model and a bilingual
and correctly translates ‘five’ as ‘paanch’. dictionary. The factor-based model uses factors
However, the news headline order is not like POS, lemma, tense and number. In addition,
retained in this case. we also described our post-processing strategy
that performs co-occurrence-based replacement of
4. Missing verbs: words to obtain correct sense of target language
Input: Veteran journalist dead words. An evaluation of the output of our
Output: Ed`gj p/кAr mt translation engine shows that it performs well
for many linguistic styles used in Hindi news
(diggaj patrakaar mrut; headlines.
Veteran journalist dead) The co-occurrence-based strategy is naive. As
The output sentence is a perfect translation
althoughaformof‘be’isabsentinthesource a future work, co-occurrence-based strategy can
sentence. be improved to incorporate inflections of words.
Also, other approaches to improve translation
5. Translation of idioms: quality may be considered.
Input: Croatia and Serbia bury the hatchet
Output: oEVyA aOr sEb
yA )gwA tm кrnA References
(kroatia aur serbia jhagDa khatam karna;
Croatia and Serbia do-end-quarrel) Manoj Kumar Chinnakotla, Om P. Damani and Avijit
The idiom ‘bury the hatchet’ gets correctly Satoskar. 2010. Transliteration for Resource-Scarce
translated to ‘)gwA tm кrnA; jhagDa Languages. Proc. ACM Trans. Asian Lang. Inf.
khatam karna; to end a quarrel’ as a Process.,
complete entity. This is a direct mapping Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
from the bilingual dictionary and does not Callison-Burch, Marcello Federico, Nicola Bertoldi,
have the correct inflection. Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
6. Sense correction due to co-occurrence Constantin, Evan Herbst. 2007. Moses: Open
Source Toolkit for Statistical Machine Translation
based replacement: Proc. of ACL 2007, demonstration session, Prague,
Input: No hike in AMU tuition fees Czech Republic
Moses-MLM-Dict: amua@yApn 'Fs m\ кoи vE
Philipp Koehn and Hieu Hoang. 2007. Factored
nhF\ Translation Models. Proc. of EMNLP-CoNLL 2007,
(amuadhyaapanfeesmeinkoipad-yaatra; Prague, Czech Republic
hike (trek) in AMU tuition fees) Philipp KoehnandFranzJosefOchandDanielMarcu,.
Moses-CoOcc: amuEш"Z "/ m\ кoи vE nhF\
2003. Statistical phrase-based translation Proc. of
(amushikshan fees mein koi vriddhi; NAACL2003,Edmonton,Canada
hike (increase) in AMU tuition fees) A. Stolcke. 2002. SRILM - An extensible language
We observe that our post-processing unit modelingtoolkit. Proc.InternationalConferenceon
improves the output in some cases. The Spoken Language Processing, vol. 2
original output translates ‘hike’ as ‘pdyA/A
(pad-yaatra ; hike)’. The co-occurrence-
basedreplacementunitidentifiesandcorrects
the sense to ‘vE (vriddhi; increase)’. We
no reviews yet
Please Login to review.