Pdf Hindi Translation 103041 | Ijaret 12 01 068

Partial capture of text on file.
                                                                          International Journal of Advanced Research in Engineering and Technology (IJARET) 
                                                                          Volume 12, Issue 1, January 2021, pp. 753-759, Article ID: IJARET_12_01_068 
                                                                          Available online at http://iaeme.com/Home/issue/IJARET?Volume=12&Issue=1 
                                                                          Journal Impact Factor (2020): 10.9475 (Calculated by GISI) www.jifactor.com 
                                                                          ISSN Print: 0976-6480 and ISSN Online: 0976-6499 
                                                                          DOI: 10.34218/IJARET.12.1.2021.068 
                                                                          © IAEME Publication                                                                                                                  Scopus Indexed 
                                                                           
                                                                                              
                                                                               VITERBI BASED PARTS OF SPEECH TAGGING 
                                                                                                                                                                                FOR HINDI AND MARATHI 
                                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                     Vijayshri Khedkar
                                                                                                                                                                                Research Scholar, Symbiosis Institute of Technology,  
                                                                                                                                                                     Symbiosis International (Deemed University), Pune, India 
                                                                                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                     Pritesh Shah
                                                                                                                                                                                                                               Symbiosis Institute of Technology,  
                                                                                                                                                                     Symbiosis International (Deemed University), Pune, India 
                                                                                             ABSTRACT 
                                                                                                                Machine translation has expanded immensely, particularly in this period. Machine 
                                                                                             translation can be broken into seven main steps namely- token generation, analyzing 
                                                                                             morphology, lexeme, tagging Part of Speech, chunking, parsing, and disambiguation in 
                                                                                             words. NLP is a promising field of research, which enables the machine to analyze and 
                                                                                             process the meaning behind human languages. The aim of our project is to assign a 
                                                                                             specific grammatical class to the input sequence of Hindi and Marathi language. Major 
                                                                                             part of India's population belongs to rural areas and these people are more comfortable 
                                                                                             and well acquainted with Hindi and Marathi Language. It is considered one of the 
                                                                                             official languages of India. But, as most of the material available online today is in 
                                                                                             English it becomes difficult for them to understand it. So, to ease up their interaction 
                                                                                             with the online portal and to make it effective, language translation comes into view 
                                                                                             and Natural Language Processing plays a key role in it. From speech recognition to 
                                                                                             sentiment  analysis,  NLP  is  the  backbone  of  this  interaction.  Furthermore,  for 
                                                                                             development  of  any  NLP  application,  POS  tagging  is  a  necessary  step.  English 
                                                                                             language tagging is  already available  so our  concentration was basically more on 
                                                                                             Hindi and Marathi corpus POS tagging. Although there are many approaches available 
                                                                                             for POS tagging like rule- based POS tagging, lexical analysis etc. we have considered 
                                                                                             the stochastic based POS tagging for our project because of its better results in other 
                                                                                             languages. 
                                                                                             Key words: POS tagging, Marathi, Rule-based tagging, Viterbi Algorithm, stochastic 
                                                                                             taggers. 
                                                                                             Cite this Article: Vijayshri Khedkar and Pritesh Shah, Viterbi Based Parts of Speech 
                                                                                             Tagging  for  Hindi  and  Marathi,  International  Journal  of  Advanced  Research  in 
                                                                                             Engineering and Technology, 12(1), 2021, pp. 753-759.  
                                                                                             http://iaeme.com/Home/issue/IJARET?Volume=12&Issue=1 
                                                                                                                                                                                                                                                                                                                        
                                                                                               http://iaeme.com/Home/journal/IJARET                                                                                                                                                                               753                                                                                                                                 editor@iaeme.com 
                                                                Viterbi Based Parts of Speech Tagging for Hindi and Marathi 
                           1. INTRODUCTION 
                           Natural Language Processing is one of the fields of machine learning. It engenders an approach 
                           through which interaction between machine and human can be made less complicated [2]. Part-
                           of-speech tagging is the process of assigning a specific grammatical class to a word like noun, 
                           pronoun, conjunction, preposition, etc. It is one of the elemental steps to approach and analyze 
                           a natural language [3]. Previously defined as, “Given a meaningful sequence of words w1...wn, 
                           the system has to assign respective POS tags t1...tn to input sequence as the output” [4]. We 
                           can state mathematically as,   
                                                                                                                                                (1) 
                                  POS tagging is a basic tool for linguistic operations on a natural language such as machine 
                           translation text recognition, named-entity recognition etc. As far as morphology is concerned, 
                           Hindi and Marathi are richer in terms of grammatical class including verb forms etc [5]. Due to 
                           high morphology, determining the uncertainty of tags is an onerous task when working on Hindi 
                           language [6]. For instance, the term “” may be a conjunction and may be a quantifier or an 
                           intensifier too depending on how it is used. 
                                  Contribution of this project includes: 
                                        •   Splitting of sentences into tokens and distributing them. 
                                        •   Part of Speech of different tokens detected. 
                                        •   Presenting POS tagging list for the sentence. 
                                  This  model  works  on  a  labeled  training  set  (39588  sentences)  and  yields  92.97%  of 
                           precision with an accuracy of 92.97%.  
                           2. VITERBI ALGORITHM 
                           Consider an Input sequence a ... a  
                                                                                1       n
                                                              arg max q(a ... a b .....b                        )                                                  (2) 
                                                                                   1       n,   1          n+1
                                  where arg max is taken over all series b …..b                                            such that b€ S for i = 1…n, and b                                    = 
                                                                                                           1          n+1                      i                                           n+1
                           STOP 
                                  We assume that p again takes the form  
                                                                                                                                        
                                                                                                                                       
                                                       q( ….a ,b …b                    ) =                                                                (3) 
                                                              1        n   1        n+1
                                                                                                                                                   
                                                                                                                                           
                                  We have assumed that  
                                                                                        =            = *, and                 = STOP 
                                                                                                                      
                                  The main purpose of using this algorithm is to discover the most optimal sequence of states 
                           using the Hidden Markov Model (HMM) and a sequence of given observations. In this context, 
                           the term optimal refers  to probability. The sequence with maximum probability is deemed 
                           optimal  by  the  model.  A  list  of  possible  tags  is  used  by  the  model  such  as  ‘S’  – {Verbs, 
                           Adjectives, Nouns, Adverbs, conjunction, etc}. Each word in each observation will be assigned 
                           with any one of the tags available in set ‘S’ [7]. A list of all possible tag sequences is formed 
                           multiplying the trigram and emission probabilities for a sequence. Each sequence formed by 
                           the model will result in a probability. The sequence with maximum probability will be deemed 
                           as optimal using a dynamic programming approach [8]. 
                           3. PROPOSED METHODOLOGY 
                           Our Project includes a Hindi and Marathi part-of-speech tagger which has three fundamental 
                           steps. First, input Hindi and Marathi text is splitted into sentences. In the next step, the sentences 
                           are tokenized into words and the third step  allocates part-of-speech tags to sentences. The 
                                  http://iaeme.com/Home/journal/IJARET                                        754                                            editor@iaeme.com 
                                                             Vijayshri Khedkar and  Pritesh Shah 
                   system  was  evaluated over a  data  of  39588  sentences.  The  data  set  used  for  training and 
                   validation contains 34588 and 5000 sentences respectively. Every word in the sentences is 
                   annotated with at least one out of 24 possible tags. There are two consecutive phases to the 
                   system. It  trains the  model in  the first  phase, using  defined words (present in  the training 
                   dataset). In the next phase it labels undefined words (present in testing dataset) and delivers a 
                   tag sequence ts.1 ..... ts.n for input series of words w.1 .... w.n. The following section details the 
                   tagset that we have implemented and the methodology that the system follows. 
                    
                                                Output: Hindi or Marathi sentence text tagged with part 
                                                                                 
                      Input: Hindi 
                                                                                                  Word to tag 
                                                  User                     Tag 
                      or Marathi                                                                    mapping 
                                                Interface               Generator 
                        sentence 
                                                                                 
                           text 
                                                                                 
                                                Splitter 
                                                                                 
                                                                                 
                                                                                 
                                                                                                       Trained 
                                                                           Viterbi 
                                                Token 
                                                                                                        corpus 
                                                                           Tagger 
                                              generatorT 
                                                                                 
                                                        Figure 1 Proposed System Architecture 
                        We have built a tagset for the Hindi and Marathi languages that includes 24 part-speech 
                   tags. The tagset is inspired by a research in CDAC, Pune[9]. It also contains tags for numbers 
                   in many formats. The entire tagset is mentioned in Table I. 
                                                              Table 1 Tags and Description 
                                                     S.No.      Tag                Description 
                                                    1         NN         Common Noun 
                                                    2         PRP        Pronoun 
                                                    3         NNP        Proper Noun 
                                                    4         PSP        Postposition 
                                                    5         JJ         Adjective 
                                                    6         INTF       Intensifier 
                                                    7         RP         Particles 
                                                    8         NEG        Negative Word 
                                                    9         RB         Adverb 
                                                    10        QF         Quantifiers 
                                                    11        DEM        Demonstrative 
                         http://iaeme.com/Home/journal/IJARET     755                                           editor@iaeme.com 
                                      Viterbi Based Parts of Speech Tagging for Hindi and Marathi 
                                           12      NST      Spatial Noun 
                                           13      SYM      Symbol 
                                           14      ECH      Echo Words 
                                           15      WQ       Question Words 
                                           16      QC       Cardinals 
                                           17      XC       Compounds 
                                           18      CC       Conjuncts 
                                           19      QO       Ordinals 
                                           20      RDP      Reduplication 
                                           21      INJ      Interjection 
                                           22      VM       Main Verb 
                                           23      VAUX  Verb Auxiliary 
                                           24      UNK      Unknown Words 
                4. EXPERIMENTS AND RESULTS 
                Various experiments  have been performed to  test the  validity, results  and precision of the 
                proposed method. Few observations of POS tagging from the method being discussed are stated 
                below: 
                Input:                   
                Output: ['JJ', 'NN', 'INTF', 'JJ', 'NN', 'VAUX', 'CC', 'PRP', 'NN', 'PSP', 'NN', 'RP', 'QC', 'NN', 
                'VM', 'VAUX'] 
                Input:              
                Output: ['JJ', 'NN', 'INTF', 'JJ', 'NN', 'VAUX', 'PRP', 'NN', 'PSP', 'NN', 'RP', 'QC', 'NN', 'VM', 
                'VAUX'] 
                Input:        2011           - 1102   
                   
                Output: ['NN', 'PSP', 'PSP', 'XC', 'NN', 'PSP', 'NN', 'PSP', 'NNP', 'PSP', 'NNP', 'NN', 'PSP', 'NN', 
                'PSP', 'NN', 'PSP', 'NN', 'NN', 'VM'] 
                Input: तुलनेत २०११ ा जनगणनेनुसार, भारतातील िबहार राातील लोकाची घनता बत चौरस बकमीवर 1102 
                लोक होते. 
                 Output: ['NN', 'PSP', 'PSP', 'XC', 'NN', 'PSP', 'NN', 'PSP', 'NNP', 'PSP', 'NNP', 'NN', 'PSP', 
                'NN', 'PSP', 'NN', 'PSP', 'NN', 'NN', 'VM'] 
                    In these examples, the Hindi and Marathi Devanagari texts are marked as per Hindi and 
                Marathi grammar with their corresponding part-of-speech class. For tagging, Viterbi algorithm 
                is applied to tag the unknown meaningful sequence of words.  
                 
                    http://iaeme.com/Home/journal/IJARET        756                         editor@iaeme.com
The words contained in this file might help you see if this file matches what you are looking for:

...International journal of advanced research in engineering and technology ijaret volume issue january pp article id available online at http iaeme com home impact factor calculated by gisi www jifactor issn print doi publication scopus indexed viterbi based parts speech tagging for hindi marathi vijayshri khedkar scholar symbiosis institute deemed university pune india pritesh shah abstract machine translation has expanded immensely particularly this period can be broken into seven main steps namely token generation analyzing morphology lexeme part chunking parsing disambiguation words nlp is a promising field which enables the to analyze process meaning behind human languages aim our project assign specific grammatical class input sequence language major s population belongs rural areas these people are more comfortable well acquainted with it considered one official but as most material today english becomes difficult them understand so ease up their interaction portal make effective ...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area