262x Filetype PDF File size 0.17 MB Source: aclanthology.org
Punjabi to Urdu Machine Translation System
1 2
Nitin Bansal , Ajit Kumar
1
Department of Computer Science, Punjabi University, Patiala, India
2
Associate Professor, Multani Mal Modi College, Patiala, India
1 2
E-mail: profnitinbansal@gmail.com, ajit8671@gmail.com
mainly. The quality of machine translation
Abstract
systems can be measured mainly using
Development of Machine Translation System (MTS)
Bi-lingual Evaluation Study (BLEU), where it
for any language pair is a challenging task for several
produces a score between 0 and 1.
reasons. Lack of lexical resources for any language is
one of the major issues that arise while developing
Among various regional languages in India, we
MTS using that language. For example, during the
have chosen Punjabi and Urdu for developing
development of Punjabi to Urdu MTS, many issues
Punjabi to Urdu Machine Translation System
were recognized while preparing lexical resources for
both the languages. Since there is no machine
(PUMTS). Punjabi is the mother tongue of our
readable dictionary available for Punjabi to Urdu
state, Punjab, where it was used as an official
which can be directly used for translation; however
language in government offices. Urdu was also
various dictionaries are available to explain the
being used as an official language in Punjab,
meaning of the word. Along with this, handling of
OOV(out of vocabulary words), handling of multiple before independence. Thus, PUMTS helps us to
sense Punjabi word in Urdu, identification of proper
make Punjabi understandable to Urdu
nouns, identification of collocations in the source
communities who still want to be in touch with
sentence i.e. Punjabi sentences in our case, are the
earlier Punjab. These two languages in India, are
issues which we are facing during development of
taken as resource-poor languages, because
this system. Since MTSs are in great demand from the
last one decade and are being widely used in
parallel corpus on language pairs is not available.
applications such as in case of smart phones.
Thus it became a challenging task for us to
Therefore, development of such a system becomes
develop parallel corpus on this language pair.
more demanding and more user friendly. Their usage
Further, it also describes types of MTSs being
is mainly in large scale translations, automated
translations; act as an instrument to bridge a digital developed with Indian and non-Indian
divide.
perspective.
1 Introduction
2 Methodology
Due to the availability of many regional
An introduction to Punjabi and Urdu languages
languages in India, machine translation in India
help in understanding about history and close
has enormous scope. Human and machine
proximity among this language pair. Since
translation have their share of challenges.
word-order of this language pair is same but
Scientifically and philosophically, machine
writing order is different from each other i.e.
translation results can be applied to various
Punjabi can be written from left-to-right and
areas such as artificial intelligence, linguistics,
Urdu from right-to-left. Mapping among
and the philosophy of language. Various
characters of language pairs has also been
approaches are required in machine translation
studied during the development of PUMTS. The
to make communication possible among two
implementation of our methodology for the
languages. These approaches can be rule-based,
development of PUMTS, where the architecture
corpus-based, hybrid or neural-based. Here,
followed during the development has been
hybrid approach is a combination of two
documented. We have proposed three approaches
approaches i.e. rule-based and corpus-based
32
Proceedings of the 17th International Conference on Natural Language Processing: System Demonstrations, pages 32–34
Patna, India, December 18 - 21, 2020. ©2019 NLP Association of India (NLPAI)
to develop bilingual parallel corpus for Punjabi (OOV) words has also been designed and
and Urdu languages. But BLEU score suggested developed, which is working as web-based
for one final approach for corpus development, nowadays. This system has been designed in two
results in higher accuracy. All the algorithms phases i.e. first on a web-based platform using
which were developed during the development ASP.Net and secondly, it has been designed for
of PUMTS, followed the final corpus approach. PUMTS, to handle OOV words during machine
Lastly, Punjabi to Urdu machine transliteration translation, using MOSES platform.
system to handle Out-of-Vocabulary Words
Chart 1: Phase-wise improvement in BLEU score for PUMTS
Human evaluation has also been conducted
3 Results and Discussion
where our evaluators are well known to both the
languages. Accuracy has been tested using
Various results had been evaluated by starting
standard automated metric methodologies i.e.
from 10000 parallel sentences to 1 lakh parallel
BLEU and NIST, on PUMTS and Google
sentences after including pre-processing and
translator. Data domains followed during the
post-processing modules. The results have been
development of parallel corpus are politics,
compared with Google translator so as to keep
sports, health, tourism, entertainment, books &
the accuracy comparable and required
improvisation can be included in PUMTS.
33
magazines, education, arts & culture, religion,
References
and literature.
Thomas D. Hedden, 1992-2010, Machine
Translation: A brief Introduction,
Since, human evaluation is still considered the
http://ice.he.net/~hedden/intro_mt.html
most reliable and efficient method to test the
P Koehn, H Huang, et al., 2007, Moses: Open
system's accuracy. However, this is impracticable
Source Toolkit for Statistical Machine
in today’s circumstances. Thus, we have used
Translation. ACL Demos, 2007.
automatic evaluation with BLEU and NIST to
Shahid Aasim Ali and Malik Muhammad
quickly and inexpensively evaluate the impact of
new ideas, algorithms, and data sets. During the Kamran, 2010, Development of parallel
evaluation of PUMTS, a sufficient bilingual corpus and English to Urdu Statistical
parallel corpus in Punjabi-Urdu language pair Machine Translation, International Journal of
(more than 1 lakh parallel sentences) has been
Engineering and Technology, PP. 31-33, Vil
used on MOSES, and automated standard metric
10 No 5, October 2010.
scores have been generated. Various methods
Ajit Kumar and Vishal Goyal, 2011,
had been applied to increase the system's
Comparative analysis of tools available for
accuracy, like the order of languages has been
developing statistical approach based
changed during the testing to analyze which one
machine translation system, in proceedings of
gives better results. Moreover, the PUMTS
International conference ICISIL 2011, Patiala
system has also been checked with the Google
(Punjab), India, PP. 254-260, March9-11.
translator output, where we have found that our
Tajinder Singh Sani, 2011, Word Disambiguation
system output performs better than Google
in Shahmukhi to Gurmukhi Transliteration,
translator with an accuracy of about 82%.
Processing of the 9th Wordshop on Asian
Following chart representation helps us to get an
Language Resources, Chiang Mai, Thailand,
idea where PUMTS generates better results
pages: 79-87, November 12 and 13.
domain-wise.
Gurpreet Singh Lehal and Tejinder Singh Saini,
As shown in chart 1, the development of
2012, Development of a Complete
PUMTS has been started from 10,000 parallel
Urdu-Hindi Transliteration System,
sentences, and the MOSES system has been
Proceedings of COLING 2012: Posters, PP.
set-up for this purpose to regularly test the
643-652, COLING 2012, Mumbai.
accuracy of this data. Therefore, phase-wise
Arif Tasleem et al, An analysis of challenge in
testing and the recording of BLEU and NIST
English and Urdu machine translation,
scores has been performed. The second phase
National conference on Recent Innovations
has been tested on 50,000 sentences, and after
and Advancements in Information
that, final evaluation has been performed on
Technology (RIAIT 2014), ISBN
more than 1,00,000 sentences. We can observe
978-93-5212-284-4
from the above chart; there was a sharp increase
Ajit Kumar and Vishal Goyal, 2015, Statistical
in accuracy when the number of sentences had
Post Editing System (SPES) applied to
been increased from 10,000 to 50,000 sentences.
Hindi-Punjabi PB-SMT system, Indian
It has also been observed that the increase in size
Journal of Science and Technology”, Vol
from 50,000 to 1,00,000 results in increments of
8(27).
accuracy at a slower rate, which is due to the
Zakir H. Mohamed and Nagnoor M. Shafeen,
handling of OOV words and increments on
2017, A brief study of challenges in machine
corpus size, gives more chances of meaningful
Translation, International journal of computer
sentences too.
Science Issues, PP. 54-57, Vol 14 No 2.
34
no reviews yet
Please Login to review.