Language Pdf 100582 | 1565 Item Download 2022-09-22 04-30-20

Partial capture of text on file.
                                 International Journal of Computational Intelligence Systems, Vol.1, No. 2 (May, 2008), 116–126
                     LANGUAGE IDENTIFICATION OF KANNADA, HINDI AND 
                ENGLISH TEXT WORDS THROUGH VISUAL DISCRIMINATING 
                                                                FEATURES 
                                                                            
                                                                            
                                                                    M.C. PADMA 
                                           Assistant Professor, Dept. of Computer Science & Engineering. 
                                                    PES College of Engineering, Mandya-571401 
                                                                   Karnataka, India 
                                                            Email: padmapes@gmail.com 
                                                                            
                                                                   DR. P.A. VIJAYA 
                                           Professor, Dept. of Electronics & Communication Engineering. 
                                                           Malnad College of Engineering 
                                                                   Hassan-573201 
                                                                   Karnataka, India 
                                                             Email: pavmkv@gmail.com 
                                                              Received:21-09-2007 
                                                                             
                                                              Revised:29-10-2008
                       In  a  multilingual  country  like  India,  a  document  may  contain  text  words  in  more  than  one  language.    For  a 
                       multilingual  environment,  multi  lingual  Optical  Character  Recognition  (OCR)  system  is  needed  to  read  the 
                       multilingual documents. So, it is necessary to identify different language regions of the document before feeding 
                       the document to the OCRs of individual language. The objective of this paper is to propose visual clues based 
                       procedure to identify Kannada, Hindi and English text portions of the Indian multilingual document.   
                       Keywords: Document mage Processing, Multi-lingual Document, Language Identification, Horizontal Lines, 
                       Vertical Lines, Feature Extraction. 
                        
                                                                               difficult  for  a  machine,  primarily  because  different 
              1.  Introduction                                                 scripts  (a  script  could  be  a  common  medium  for 
              Language identification is an important topic in pattern         different  languages)  are  made  up  of  different  shaped 
              recognition  and  image  processing  based  automatic            patterns to produce different character sets [4].   
              document  analysis  and  recognition.  The  objective  of        OCR  is  of  special  significance  for  a  multi-lingual 
              language identification is to translate human identifiable       country  like  India,  where  the  text  portion  of  the 
              documents to machine identifiable codes [1]. The world           document usually contains information in more than one 
              we  live  in,  is  getting  increasingly  interconnected,        language.  A  document  containing  text  information  in 
              electronic libraries have become more pervasive [2] and          more  than  one  language  is  called  a  multilingual 
              at  the  same  time  increasingly automated including the        document. For such type of multilingual documents, it is 
              task  of  presenting  a  text  in  any  language  as             very essential to identify the text language portion of the 
              automatically  translated  text  in  any  other  language.       document, before the analysis of the contents could be 
              Identification of the language in a document image is of         made.  Although  a  great  number  of  OCR  techniques 
              primary  importance  for  selection  of  a  specific  OCR        have  been  developed  over  years  [5,  6],  almost  all 
              system  processing  multi  lingual  documents  [3].              existing  works  on  OCR  make  an  important  implicit 
              Language identification may seem to be an elementary             assumption  that  the  language  of  the  document  to  be 
              and simple issue for humans in the real world, but it is         processed  is  known  beforehand  [2].  Individual  OCR 
                                                                               tools  have been developed to deal best with only one 
                                                              Published by Atlantis Press                                               116
                                 International Journal of Computational Intelligence Systems, Vol.1, No. 2 (May, 2008), 116–126
                   M.C.Padma and P.A.Vijaya 
               
              specific  language  [7].  In  an  automated  environment         Karnataka.  Under  the  three  language  formulae  [8], 
              such  document  processing  systems  relying  on  OCR            adopted by most of the Indian states, the document in a 
              would  clearly  need  human  intervention  to  select  the       state  may be printed in its respective official regional 
              appropriate OCR package, which is certainly inefficient,         language,  the  national  language  Hindi  and  also  in 
              undesirable  and  impractical  [4].  A  pre-OCR  language        English.  Accordingly,  a  document  produced  in 
              identification  system  would  enable  the  correct  OCR         Karnataka, a state in India, may be printed in its official 
              system  to  be  selected  in  order  to  achieve  the  best      regional  language  Kannada,  national  language  Hindi 
              character interpretation of the document [7]. This area          and also  in  English.  For  such  an  environment,  multi-
              has not been very widely researched to date, despite its         lingual OCR system is needed to read the multilingual 
              growing importance to the document image processing              documents.  To  make  a  multilingual-OCR  system 
              community and the progression towards the “paperless             successful, it is necessary to develop the multilingual-
              office”  [7].    Keeping  this  drawback  in  mind,  in  this    OCR  system  that  would  work  in  two  stages:  (i) 
              paper  an  attempt  has  been  made  to  solve  a  more          Identification  and  separation  of  different  language 
              foundation problem of language identification of a text          portions of the document and (ii) Feeding of individual 
              from a multilingual document, before  its  contents  are         language  regions  to  appropriate  OCR  system.  In  this 
              automatically read.                                              paper, we focus on the first stage of the multilingual-
              Language identification is one of the vision application         OCR system and present procedures for identification 
              problems.  Generally  human  system  identifies  the             and  separation  of  Kannada,  Hindi  and  English  text 
              language    in   a   document  using  some  visible              portions  of  the  multilingual  document  produced  at 
              characteristic features such as texture, horizontal lines,       Karnataka, an Indian state. In the present case, it could 
              vertical lines, which are visually perceivable and appeal        also be called as script or language identification, since 
              to  visual  sensation.  This  human  visual  perception          the three languages Kannada, Hindi and English belong 
              capability has been the motivator for the development of         to three different scripts. 
              the proposed system. With this context, in this paper, an 
              attempt  has  been  made  to  simulate  the  human  visual       1.1.  Previous work 
              system,  to identify  the  type  of  the  language  based  on    From  the  literature  survey,  it  has  been  revealed  that 
              visual  clues,  without  reading  the  contents  of  the         some  amount  of  work  has  been  carried  out  in 
              document.                                                        script/language  identification. Peake and Tan [7] have 
              In  a  multi-lingual  country  like  India  (India  has  18      proposed a method for automatic script  and language 
              regional languages derived from 12 different scripts; a          identification  from  document  images  using  multiple 
              script  could  be  a  common  medium  for  different             channel  (Gabour)  filters  and  gray  level  co-occurrence 
              languages [8]),  documents  like  bus  reservation  forms,       matrices for seven languages: Chinese, English, Greek, 
              passport  application  forms,  examination  question             Korean, Malayalam, Persian and Russian. Tan [2] has 
              papers,  bank-challen,  language  translation  books  and        developed  rotation  invariant  texture  feature  extraction 
              money-order forms may contain text words in more than            method  for  automatic  script  identification  for  six 
              one  language  forms.  For  such  an  environment,  multi        languages:  Chinese,  Greek,  English,  Russian,  Persian 
              lingual OCR system is needed to read the multilingual            and  Malayalam.  In  the  context  of  Indian  languages, 
              documents.  To  make  a  multi-lingual  OCR  system              some  amount  of  research  work  on  script/language 
              successful,  it  is  necessary  to  separate  portions  of       identification  has  been  reported  [8,10,11,13].  Pal  and 
              different  language  regions  of  the  document  before          Choudhuri [8] have proposed an automatic technique of 
              feeding  to  individual  OCR  systems.  In  this  direction,     separating the text lines from 12 Indian scripts (English, 
              multi lingual document segmentation has strong direct            Devanagari,  Bangla,  Gujarati,  Kannada,  Kashmiri, 
              application  potential,  especially  in  a  multilingual         Malayalam, Oriya, Punjabi, Tamil,  Telugu and Urdu) 
              country like India.                                              using  ten  triplets  formed  by  grouping  English  and 
              In  the  context  of  Indian  languages,  some  amount  of       Devanagari with any one of the other scripts. Santanu 
              research  work  has  been  reported  [2,  4,  8,  9].  Further   Choudhuri,  et  al.  [3]  have  proposed  a  method  for 
              there is a growing demand for automatically processing           identification of Indian languages by combining Gabour 
              the  documents  in  every  state  in  India  including           filter based technique and direction distance histogram 
                                                              Published by Atlantis Press                                               117
                                     International Journal of Computational Intelligence Systems, Vol.1, No. 2 (May, 2008), 116–126
                                                  Language Identification of Kannada, Hindi and English Text words Through Visual Discriminating Features 
                 
                classifier   considering  Hindi,  English,  Malayalam,                     Kannada,  Hindi  and  English.  It  is  reasonably  natural 
                Bengali,  Telugu  and  Urdu.  Basavaraj  Patil  and                        that  the  documents  produced  at  the  border  regions  of 
                Subbareddy [9] have developed a character script class                     Karnataka may also be printed in the regional languages 
                identification  system  for  machine  printed  bilingual                   of the neighboring states like Telugu, Tamil, Malayalam 
                documents  in  English  and  Kannada  scripts  using                       and Urdu. The system [4] was unable to identify the text 
                probabilistic  neural  network.  Pal  and  Choudhuri  [10]                 words for such documents having text words in Telugu, 
                have  proposed  an  automatic  separation  of  Bangla,                     Tamil, Malayalam, Urdu languages and hence these text 
                Devanagari  and  Roman  words  in  multilingual  multi-                    words were misclassified into any one among the three 
                script Indian documents. Nagabhushan et.al. [13] have                      languages, whichever is nearer and similar in its visual 
                proposed a fuzzy statistical approach to Kannada vowel                     appearance.  For  example,  Telugu  is  misclassified  as 
                recognition based on invariant moments. Pal et. al. [12]                   Kannada and Tamil is misclassified as English. If the 
                have suggested a word-wise script identification model                     document  consists  of  text  words  in  other  than  the 
                from a document  containing English, Devanagari  and                       anticipated  languages,  our  previous  algorithm  fails  to 
                Telugu  text.  Chanda  and  Pal  [11]  have  proposed  an                  identify the type of the language by misclassifying the 
                automatic  technique  for  word-wise  identification  of                   text words. 
                Devanagari,  English  and  Urdu  scripts  from  a  single                  Keeping the drawback of the previous method [15] in 
                document.  Spitz  [18]  has  proposed  a  technique  for                   mind,  we  have  proposed  a  system  that  would  more 
                distinguishing Han and Latin based scripts on the basis                    accurately  identify  and  separate  different  language 
                of  spatial  relationships  of  features  related  to  the                 portions of Kannada, Hindi and English documents and 
                character  structures.  Pal  et  al.  [19]  have  developed  a             also  to  classify  the  portions  of  the  document  in  other 
                script  identification technique for Indian languages by                   than these three languages into a fourth class category - 
                employing  new  features  based  on  water  reservoir                      OTHERS, as our intension is to identify only Kannada, 
                principle, contour tracing, jump discontinuity, left and                   Hindi  and  English.  The  system  identifies  the  three 
                right profile. Ramachandra et al. [20] have proposed a                     languages  in  four  stages:  in  the  first  stage  Hindi  is 
                method  based  on  rotation-  invariant  texture  features                 identified, in the second stage Kannada is identified, in 
                using  multichannel  Gabor  filter  for  identifying  six                  the third stage English is identified and in the fourth and 
                (Bengali,  Kannada,  Malayalam,  Oriya,  Telugu  and                       the last stage, languages other than Kannada, Hindi and 
                Marathi)  Indian  languages.  Hochberg  et  al.  [21]  have                English are grouped into fourth class category OTHERS 
                presented  a  system  that  automatically  identifies  the                 without  identifying  the  type  of  that  language  as  our 
                script form using cluster-based templates. Gopal et al.                    main  aim  is  to  focus  only  on  Kannada,  Hindi  and 
                [22]  have  presented  a  scheme  to  identify  different                  English languages.  
                Indian scripts through hierarchical classification which                   This paper is organized as follows. Section 2 describes 
                uses features extracted from the responses of a multi-                     some  discriminating  features  in  the  characters  of 
                channel  log-Gabor  filter.  Our  survey  for  previous                    Kannada, Hindi and English text words. In Section 3, 
                research work in the area of document script/language                      two models proposed for identifying the three languages 
                identification  shows  that  much  of  them  rely  on                      - Kannada, Hindi and English, have been discussed. The 
                script/languages  followed  by  other  countries  and  few                 experimental  details  and  the  results  obtained  are 
                from  our  country,  but  hardly  few  attempts  focus  on                 presented in section 4. Conclusions are given in section 
                these  three  languages  Kannada,  Hindi  and  English                     5. 
                followed in Karnataka, an Indian state.                                     
                In one of my earlier works [4], it is assumed that a given                  
                document  should  contain  the  text  lines  in  one  of  the 
                three languages Kannada, Hindi and English. In one of                      2.  Some Visual Discriminating Features of 
                my  previous  papers  [14],  the  results  of  detailed                        Kannada, Hindi and English Text Words 
                investigations were presented related to the study of the                  Feature extraction is an integral part of any recognition 
                applicability of horizontal and vertical projections and                   system. The aim of feature extraction is to describe the 
                segmentation  methods  to  identify  the  language  of  a                  pattern  by  means  of  minimum  number  of  features  or 
                document considering specifically the three  languages                     attributes  that  are  effective  in  discriminating  pattern 
                                                                        Published by Atlantis Press                                                          118
                                             International Journal of Computational Intelligence Systems, Vol.1, No. 2 (May, 2008), 116–126
                          M.C.Padma and P.A.Vijaya 
                    
                   classes [13]. The new algorithms presented in this paper                                  vowels  and  the  remaining  characters  are  consonants 
                   are  inspired  by  a  simple  observation  that  every                                    [11].  A  consonant  combined  with  a  vowel  forms  a 
                   script/language defines a finite set of text patterns, each                               modified  compound  character  resulting  in  more  than 
                   having a distinct visual appearance [1].  The  character                                  one  component  and  is  much  larger  in  size  than  the 
                   shape  descriptors  take  into  account  any  feature  that                               corresponding basic character. It  could be  seen that  a 
                   appears  to  be  distinct  for  the  language  [1]  and  hence                            document in Kannada language is made up of collection 
                   every language could be identified based on its visual                                    of basic and compound characters resulting in equal and 
                   discriminating features.                                                                  unequal  sized  characters  [11]  with  some  characters 
                   Presence and absence of the four discriminating features                                  having  more  than  one  component,  which  could  be 
                   of Kannada, Hindi and English text words are given in                                     expected  to  support  in  identifying  the  text  words  of 
                   Table-1.                                                                                  Kannada language.  
                   2.1.  Some visual discriminating features of Hindi                                        Some typical Kannada words are given below: 
                          language                                                                               
                   In Hindi (Devanagari) language, many characters have a 
                   horizontal  line  at  the  upper  part.  This  line  is  called                                                                                                   
                   sirorekha in Devanagari [8]. However, we shall call it as                                  
                   head-line.  It  could  be  seen  that,  when  two  or  more 
                   characters sit side by side to form a word, the character                                   Table-1. Presence and absence of discriminating features of 
                   head-line  segments  mostly  join  one  another  in  a  word                                             Kannada, Hindi and English text words. 
                   resulting in only one component within each text word                                    ( Yes means presence and No means absence of that feature. 
                   and  generates  one  continuous  head-line  for  each  text                              F1:  Horizontal  lines;  F2:  Vertical  lines;  F3:  Variable  sized 
                   word.  Since the characters are connected through their                                   blocks; F4: Blocks with more than one component ) 
                   head-line  portions,  a  Hindi  word  appears  as  a  single                            
                   component and hence it cannot be segmented further into                                    Discriminating  F1                 F2           F3          F4 
                   blocks, which could be used as a visual discriminating                                              Features.    
                   feature to recognize Hindi language. We can also observe                                   Text words 
                   that most of the Hindi characters have vertical line like                                  Kannada              Yes           No           Yes         Yes  
                   structures.  It  could  be  seen  that  since  two  or  more                               Hindi                Yes            Yes          Yes        No 
                   characters are connected together through their head-line                                  English              No             Yes         No          No 
                   portions, the width of the block is much larger than the                                
                   height  of  the  text  line.  Some  typical  Hindi  words  are 
                   given below:                                                                              2.4. Zonalization of Kannada, Hindi and English 
                                                                                                                   Text Lines 
                                                                                                             Pal and Choudhuri [8] have proposed that text lines of 
                                                                                                             some Indian languages might be partitioned into three 
                                                                                                             zones. In this paper, we have adopted the zonalization 
                   2.2. Some visual discriminating features of                                               proposed by Pal and Choudhuri [8], which is useful in 
                          English language                                                                   this method for feature extraction. A sample text line in 
                                                                                                             English,          Hindi          and         Kannada            languages, 
                   It has been found that a distinct characteristic of most of                               partitioned/zonalized  into  three  zones  is  shown  in 
                   the English characters is the existence of vertical line-like 
                   structures  [8]  and  uniform  sized  characters  with  each                              Figure-1. Related terminologies used in partitioning the 
                   characters  having  only  one  component  (except  “i”  and                               text lines are summarized below: 
                   “j” in lower-case).                                                                       An  imaginary  line  where  the  first  uppermost  black 
                                                                                                             pixels of characters of a text line lies is called an upper 
                   2.3. Some visual discriminating features of                                               line. An imaginary line where the first lowermost black 
                          Kannada language                                                                   pixels of characters of a text line lies is called a lower 
                   It  could  be  seen  that  most  of  the  Kannada  characters                             line. An imaginary line, where the maximum number of 
                   have horizontal line like structures. Kannada character                                   uppermost black pixels of characters of a text line lies, 
                   set has 50 basic characters, out of which the first 14 are                                is  called  a  mean  line.  An  imaginary  line,  where 
                                                                                       Published by Atlantis Press                                                                           119
The words contained in this file might help you see if this file matches what you are looking for:

...International journal of computational intelligence systems vol no may language identification kannada hindi and english text words through visual discriminating features m c padma assistant professor dept computer science engineering pes college mandya karnataka india email padmapes gmail com dr p a vijaya electronics communication malnad hassan pavmkv received revised in multilingual country like document contain more than one for environment multi lingual optical character recognition ocr system is needed to read the documents so it necessary identify different regions before feeding ocrs individual objective this paper propose clues based procedure portions indian keywords mage processing horizontal lines vertical feature extraction difficult machine primarily because introduction scripts script could be common medium an important topic pattern languages are made up shaped image automatic patterns produce sets analysis special significance translate human identifiable where portion...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area