220x Filetype PDF File size 0.45 MB Source: www.iaeme.com
International Journal of Advanced Research in Engineering and Technology (IJARET)
Volume 12, Issue 1, January 2021, pp. 753-759, Article ID: IJARET_12_01_068
Available online at http://iaeme.com/Home/issue/IJARET?Volume=12&Issue=1
Journal Impact Factor (2020): 10.9475 (Calculated by GISI) www.jifactor.com
ISSN Print: 0976-6480 and ISSN Online: 0976-6499
DOI: 10.34218/IJARET.12.1.2021.068
© IAEME Publication Scopus Indexed
VITERBI BASED PARTS OF SPEECH TAGGING
FOR HINDI AND MARATHI
Vijayshri Khedkar
Research Scholar, Symbiosis Institute of Technology,
Symbiosis International (Deemed University), Pune, India
Pritesh Shah
Symbiosis Institute of Technology,
Symbiosis International (Deemed University), Pune, India
ABSTRACT
Machine translation has expanded immensely, particularly in this period. Machine
translation can be broken into seven main steps namely- token generation, analyzing
morphology, lexeme, tagging Part of Speech, chunking, parsing, and disambiguation in
words. NLP is a promising field of research, which enables the machine to analyze and
process the meaning behind human languages. The aim of our project is to assign a
specific grammatical class to the input sequence of Hindi and Marathi language. Major
part of India's population belongs to rural areas and these people are more comfortable
and well acquainted with Hindi and Marathi Language. It is considered one of the
official languages of India. But, as most of the material available online today is in
English it becomes difficult for them to understand it. So, to ease up their interaction
with the online portal and to make it effective, language translation comes into view
and Natural Language Processing plays a key role in it. From speech recognition to
sentiment analysis, NLP is the backbone of this interaction. Furthermore, for
development of any NLP application, POS tagging is a necessary step. English
language tagging is already available so our concentration was basically more on
Hindi and Marathi corpus POS tagging. Although there are many approaches available
for POS tagging like rule- based POS tagging, lexical analysis etc. we have considered
the stochastic based POS tagging for our project because of its better results in other
languages.
Key words: POS tagging, Marathi, Rule-based tagging, Viterbi Algorithm, stochastic
taggers.
Cite this Article: Vijayshri Khedkar and Pritesh Shah, Viterbi Based Parts of Speech
Tagging for Hindi and Marathi, International Journal of Advanced Research in
Engineering and Technology, 12(1), 2021, pp. 753-759.
http://iaeme.com/Home/issue/IJARET?Volume=12&Issue=1
http://iaeme.com/Home/journal/IJARET 753 editor@iaeme.com
Viterbi Based Parts of Speech Tagging for Hindi and Marathi
1. INTRODUCTION
Natural Language Processing is one of the fields of machine learning. It engenders an approach
through which interaction between machine and human can be made less complicated [2]. Part-
of-speech tagging is the process of assigning a specific grammatical class to a word like noun,
pronoun, conjunction, preposition, etc. It is one of the elemental steps to approach and analyze
a natural language [3]. Previously defined as, “Given a meaningful sequence of words w1...wn,
the system has to assign respective POS tags t1...tn to input sequence as the output” [4]. We
can state mathematically as,
(1)
POS tagging is a basic tool for linguistic operations on a natural language such as machine
translation text recognition, named-entity recognition etc. As far as morphology is concerned,
Hindi and Marathi are richer in terms of grammatical class including verb forms etc [5]. Due to
high morphology, determining the uncertainty of tags is an onerous task when working on Hindi
language [6]. For instance, the term “” may be a conjunction and may be a quantifier or an
intensifier too depending on how it is used.
Contribution of this project includes:
• Splitting of sentences into tokens and distributing them.
• Part of Speech of different tokens detected.
• Presenting POS tagging list for the sentence.
This model works on a labeled training set (39588 sentences) and yields 92.97% of
precision with an accuracy of 92.97%.
2. VITERBI ALGORITHM
Consider an Input sequence a ... a
1 n
arg max q(a ... a b .....b ) (2)
1 n, 1 n+1
where arg max is taken over all series b …..b such that b€ S for i = 1…n, and b =
1 n+1 i n+1
STOP
We assume that p again takes the form
q( ….a ,b …b ) = (3)
1 n 1 n+1
We have assumed that
= = *, and = STOP
The main purpose of using this algorithm is to discover the most optimal sequence of states
using the Hidden Markov Model (HMM) and a sequence of given observations. In this context,
the term optimal refers to probability. The sequence with maximum probability is deemed
optimal by the model. A list of possible tags is used by the model such as ‘S’ – {Verbs,
Adjectives, Nouns, Adverbs, conjunction, etc}. Each word in each observation will be assigned
with any one of the tags available in set ‘S’ [7]. A list of all possible tag sequences is formed
multiplying the trigram and emission probabilities for a sequence. Each sequence formed by
the model will result in a probability. The sequence with maximum probability will be deemed
as optimal using a dynamic programming approach [8].
3. PROPOSED METHODOLOGY
Our Project includes a Hindi and Marathi part-of-speech tagger which has three fundamental
steps. First, input Hindi and Marathi text is splitted into sentences. In the next step, the sentences
are tokenized into words and the third step allocates part-of-speech tags to sentences. The
http://iaeme.com/Home/journal/IJARET 754 editor@iaeme.com
Vijayshri Khedkar and Pritesh Shah
system was evaluated over a data of 39588 sentences. The data set used for training and
validation contains 34588 and 5000 sentences respectively. Every word in the sentences is
annotated with at least one out of 24 possible tags. There are two consecutive phases to the
system. It trains the model in the first phase, using defined words (present in the training
dataset). In the next phase it labels undefined words (present in testing dataset) and delivers a
tag sequence ts.1 ..... ts.n for input series of words w.1 .... w.n. The following section details the
tagset that we have implemented and the methodology that the system follows.
Output: Hindi or Marathi sentence text tagged with part
Input: Hindi
Word to tag
User Tag
or Marathi mapping
Interface Generator
sentence
text
Splitter
Trained
Viterbi
Token
corpus
Tagger
generatorT
Figure 1 Proposed System Architecture
We have built a tagset for the Hindi and Marathi languages that includes 24 part-speech
tags. The tagset is inspired by a research in CDAC, Pune[9]. It also contains tags for numbers
in many formats. The entire tagset is mentioned in Table I.
Table 1 Tags and Description
S.No. Tag Description
1 NN Common Noun
2 PRP Pronoun
3 NNP Proper Noun
4 PSP Postposition
5 JJ Adjective
6 INTF Intensifier
7 RP Particles
8 NEG Negative Word
9 RB Adverb
10 QF Quantifiers
11 DEM Demonstrative
http://iaeme.com/Home/journal/IJARET 755 editor@iaeme.com
Viterbi Based Parts of Speech Tagging for Hindi and Marathi
12 NST Spatial Noun
13 SYM Symbol
14 ECH Echo Words
15 WQ Question Words
16 QC Cardinals
17 XC Compounds
18 CC Conjuncts
19 QO Ordinals
20 RDP Reduplication
21 INJ Interjection
22 VM Main Verb
23 VAUX Verb Auxiliary
24 UNK Unknown Words
4. EXPERIMENTS AND RESULTS
Various experiments have been performed to test the validity, results and precision of the
proposed method. Few observations of POS tagging from the method being discussed are stated
below:
Input:
Output: ['JJ', 'NN', 'INTF', 'JJ', 'NN', 'VAUX', 'CC', 'PRP', 'NN', 'PSP', 'NN', 'RP', 'QC', 'NN',
'VM', 'VAUX']
Input:
Output: ['JJ', 'NN', 'INTF', 'JJ', 'NN', 'VAUX', 'PRP', 'NN', 'PSP', 'NN', 'RP', 'QC', 'NN', 'VM',
'VAUX']
Input: 2011 - 1102
Output: ['NN', 'PSP', 'PSP', 'XC', 'NN', 'PSP', 'NN', 'PSP', 'NNP', 'PSP', 'NNP', 'NN', 'PSP', 'NN',
'PSP', 'NN', 'PSP', 'NN', 'NN', 'VM']
Input: तुलनेत २०११ ा जनगणनेनुसार, भारतातील िबहार राातील लोकाची घनता बत चौरस बकमीवर 1102
लोक होते.
Output: ['NN', 'PSP', 'PSP', 'XC', 'NN', 'PSP', 'NN', 'PSP', 'NNP', 'PSP', 'NNP', 'NN', 'PSP',
'NN', 'PSP', 'NN', 'PSP', 'NN', 'NN', 'VM']
In these examples, the Hindi and Marathi Devanagari texts are marked as per Hindi and
Marathi grammar with their corresponding part-of-speech class. For tagging, Viterbi algorithm
is applied to tag the unknown meaningful sequence of words.
http://iaeme.com/Home/journal/IJARET 756 editor@iaeme.com
no reviews yet
Please Login to review.