157x Filetype PDF File size 0.85 MB Source: research.ijcaonline.org
International Journal of Computer Applications (0975 – 8887) Volume 39– No.6, February 2012 Shirorekha Chopping Integrated Tesseract OCR Engine for Enhanced Hindi Language Recognition Nitin Mishra C. Patvardhan C. Vasantha Lakshmi Sarika Singh Dept. of Phy. & Comp. Sc. Dept. of Electrical Engg. Dept. of Phy. & Comp. Sc. Dept. of Phy. & Comp. Sc. Dayalbagh Edu. Institute Dayalbagh Edu. Institute Dayalbagh Edu. Institute Dayalbagh Edu. Institute Dayalbagh, Agra, India Dayalbagh, Agra, India Dayalbagh, Agra, India Dayalbagh, Agra, India ABSTRACT characters. It is highly desirable to choose a Smart Database Tesseract OCR Engine is one of the most efficient open having all basic characters, half characters, and the minimal source OCR engines currently available. Recently, Tesseract set of conjunct character combinations that may occur in some OCR 3.01 is capable of recognizing Hindi language but still it word and left out all unfavorable combinations. The needs some enhancement to improve the performance. The segmentation issues related to Shirorekha based scripts are Hindi language recognition accuracy is quite low even for the presented in [7]. Basically the proposed Hindi Language printed text, as the conjunct character combinations of Hindi Database consists of basic vowels, consonants, extensions, Language are not easily separable due to partial overlapping. special symbols, punctuation marks, English numerals, The proposed approach solves this problem, so that Devnagari numerals and minimal set of favorable vowel- Devanagari conjunct characters can easily be segmented and consonant combinations, bi-consonant combinations and bi- recognized using Tesseract OCR Engine. This paper presents consonant-vowel combinations. Tesseract based researches a complete methodology to improve The Hindi Language have shown robust results on Bangla and Kannada languages Recognition accuracy. This paper also presents comparison [8, 9] but still no efficient recognition results had been shown with other Devanagari OCR engines available on the basis of for Hindi language. This paper presents an improvement in recognition accuracy, processing time, font variations and printed Devanagari script recognition using Tesseract OCR database size. Engine. General Terms Table 1: General Vowels Pattern Recognition ऄ अ आ इ ई उ ा़ ऽा ाा ाि ाी Keywords a aa/A e/i ee/ii u oo/uu Tesseract, Hindi, OCR, Shirorekha Chopping, Character ए ऐ ओ औ ऄं ऄः Segmentation ा ा ा ा ां ाः 1. INTRODUCTION e ai o ou aM aH Today, Tesseract is considered one of the most accurate open source OCR engines available. Tesseract OCR Engine was Table 2: Other Vowels one of the best 3 engines in 1995 UNLV Accuracy Test. ॠ ॡ ॐ Between 1995 and 2006 however; there was little activity in r^^ l^^ AUM Tesseract, until it was open sourced by HP and UNLV in 2005. It was again re-released to the open source community Table 3: Consonants in August of 2006 by Google [1]. Tesseract has ability to train क ख ग घ ङ for newer language and scripts as well [2]. A complete ka kha ga gha nga overview of Tesseract OCR engine can be found in [3]. While च छ ज झ ञ Tesseract was originally developed for English, it has since cha chha ja jha nja been extended to recognize French, Italian, Catalan, Czech, ट ठ ड ढ ण Danish, Polish, Bulgarian, Russian, Greek, Korean, Spanish, Ta Tha Da Dha Na Japanese, Dutch, Chinese, Indonesian, Swedish, German, त थ द ध न Thai, Arabic, and Hindi etc. Training the Tesseract OCR ta tha da dha na Engine for Hindi language requires in-depth knowledge of प फ ब भ म Devnagari script in order to collect the character set [4]. pa Pha/fa ba bha ma Moreover, Tesseract OCR Engine does not just require य र ल व श training of the collected dataset but also to tackle the character ya ra la va/wa Sha segmentation and clubbing issues based on the script specific ष स ह क्ष त्र shh sa ha ksh tra features [5] i.e. Shirorekha, maatra etc. Hindi language has ज्ञ enormous number of character combinations [6]; it is not a jnja good technique to train all the possible combinations of Hindi 19 International Journal of Computer Applications (0975 – 8887) Volume 39– No.6, February 2012 Table 4: Dot+Consonants (Extensions) 2.1 Training Data Generation ऩ ऱ ऴ क़ ख़ ग़ The basic guideline to prepare training data has very clearly .na .ra .La .ka .kha .ga explained in [10], which is followed to prepare the customized ज़ ड़ ढ़ फ़ य़ training data. It has following phases described below: .ja .Da .Dha .fa .ya 2.1.1 Smart Hindi database selection Table 5: Special Symbols The Training database consists of 15 vowels, 36 consonants, Anusvara Visarga Chandra Chandra 11 extensions, 13 special symbols, 18 punctuation marks and ां ाः Bindu ा other symbols, 10 English numerals, 10 Devnagari numerals, ा a minimal set of 218 vowel-consonant combinations, 276 bi- Nukta Virama Udatta Anudatta consonant combinations and 179 bi-consonant-vowel ाऺ ा ा ा combinations, providing a total of 786 character combinations Deergha Grave of 18 pt. sized mangal font. The coarse classification of Hindi Purna virama virama Avagraha Accent characters is presented in [11]. । ॥ ऻ ा Accute Accent 2.1.2 Training image generation ा It involves the sufficiently spaced out single font specific text image creation. For each new font Tesseract OCR Engine suggests preparation of a new image file. Table 6: Punctuation Marks and Other Symbols “ ? ; % * / ( ) \ 2.1.3 Box file generation = { } [ ] , - : ! The information about the Bounding Boxes for all the characters present in the training image is generated for specifying Devanagari script components in the box file. The Table 7: Numerals default generated Bounding boxes can easily be edited using ० १ २ ३ ४ ५ ६ ७ ८ ९ box file editors i.e. cowboxer tool etc. 0 1 2 3 4 5 6 7 8 9 2.1.4 Train file generation Box file editors also allow editing the corresponding Unicode 2. METHODOLOGY characters against appropriate Bounding boxes. As Fig 1 shows, the proposed approach can be divided into two major components described below: 2.1.5 Character set file generation Character set file is required to specify the information like Training Data Generation Test Data Processing uppercase, lowercase, digits, punctuation marks etc. about the Smart Hindi Database Selection Shirorekha Chopping Based Unicode characters. Since Devanagari does not distinguish Preprocessing upper and lower case characters, only digits and punctuation Training Image Generation marks have to be specified. Binarization 2.1.6 Font properties selection Box file Generation Noise Elimination Font properties like italic, bold, fixed, serif etc. are required to Train file generation be specified before training the data. In this work only normal Blob Detection fonts have been considered. Character set file generation Skew Detection and Correction 2.1.7 Feature extraction Font properties Selection This phase extracts the features of the shape of characters Character Segmentation from the Training Data Image. Feature Extraction Matching 2.1.8 Clustering Clustering This phase clusters the character shape features into P ost Processing prototypes. Dictionary Data Preparation Result Generation 2.1.9 Dictionary data preparation Post Processing Ambiguity Removal Tesseract may use up to 5 types of Dictionary files which are Recognizing the Test Image converted into Directed Acyclic Word Graph (DAWG) files. Training Data Compaction 2.1.10 Post processing ambiguity removal Recognizing the Test Image Editing the unicharambigs file allows removing the intrinsic Fig 1: Block Level Diagram ambiguity between two similar looking characters or their combinations by using a substitution rule. 20 International Journal of Computer Applications (0975 – 8887) Volume 39– No.6, February 2012 2.1.11 Training data compaction The dots in Fig 3 represent the chopping points on the Finally all the generated files are compacted into a single file. Shirorekha for corresponding word in the Test image. OS used: Ubuntu 10.04 Tesseract OCR version used: 3.01 Training image used: hin.mangal.exp1.tif Commands used for Training Data Generation: tesseract hin.mangal.exp1.tif hin.mangal.exp1 batch.nochop makebox tesseract hin.mangal.exp1.tif hin.mangal.exp1 nobatch box.train unicharset_extractor hin.mangal.exp1.box cp unicharset hin.unicharset Fig 4: Shirorekha Chopping in Test Image echo mangal 0 0 0 0 0 > font_properties Fig 4 illustrates the Shirorekha Chopping. The small short mftraining –F font_properties –U hin.unicharset hin.mangal.exp1.tr lines highlight those valleys, at which distance between the cntraining hin.mangal.exp1.tr bottom of the valley and the x-axis of corresponding vertical mv Microfeat hin.Microfeat histogram goes below a threshold, T. Thus Shirorekha is mv normproto hin.normproto chopped at these valleys. After the preprocessing gets mv pffmtable hin.pffmtable completed, the Shirorekha Chopped test image as shown in mv mfunicharset hin.mfunicharset Fig 5 is obtained. mv inttemp hin.inttemp wordlist2dawg frequent_words_list hin.freq-dawg hin.unicharset combine_tessdata hin . Fig 2: Resources and Commands used Fig 2 lists all the resources and commands used from the Fig 5: Shirorekha Chopped Test Image experimental point of view. The Mangal font was used in training image. The Shirorekha Chopped test image is now easily segmented using inbuilt segmentation technique of Tesseract OCR Engine as shown in Fig 6. 2.2 Test Data Processing This component can be categorized basically in two sub components described below: 2.2.1 Shirorekha Chopping Algorithm In the Preprocessing Phase, the horizontal and vertical histograms are generated for each line of the text identified in the test image. The Shirorekha of the Text in the image is chopped each time the distance between the bottom of the Fig 6: Shirorekha Chopping based Character valley and the x-axis of corresponding vertical histogram goes Segmentation below a threshold T, which is dependent on the font size. The 2.2.2 Recognizing the Test Image motivation behind the Shirorekha Chopping is that by In this Phase, the preprocessed test image is recognized using applying good segmentation techniques the performance of Training Data. OCR can be increased [12]. Test image used: test.tif Font used: Mangal Commands used for Test Data Processing: Font size: 18 Threshold=18/8=2.25 tesseract test.tif result –l hin 3. EXPERIMENTAL RESULTS The recognition accuracy, the processing time, and the size of database with preprocessing and font variations, was tested against Google’s hin.traineddata [13] and Parichit’s hin.traineddata [14]. 2.25 Fig 3: Shirorekha Chopping based on Font size specific threshold Fig 7: Test image 21 International Journal of Computer Applications (0975 – 8887) Volume 39– No.6, February 2012 Fig 8: Experimental Results Comparison The test image sample taken is shown in Fig 7. The Test Table 11: Training Data Size Comparison Results can be compared by Fig 8. After a number of tests, the Training Data Training font final results were obtained, which are described below: size Table 8: Font Variation Tolerance Comparison Google’s 13.8 MB - hin.traineddata Recognition rate Recognition rate Parichit’s 13.1 MB - with Mangal as with Krutidev as hin.traineddata Testing font Testing font Proposed 7.5 MB Mangal Google’s 45.6 % 44.8 % hin.traineddata hin.traineddata Parichit’s 23.4 % 21.2 % hin.traineddata 4. CONCLUSIONS Proposed 94.9 % 86.9 % There is a significant improvement in the recognition rate, hin.traineddata processing time and the size of training database after integrating Shirorekha Chopping with Tesseract OCR Engine. Table 9: Average Recognition Rate Comparison Table 8 shows the higher accuracy for testing font being same Average Preprocessing as that of training font but lower accuracy for testing font Recognition used on Test being different from the training font, but still the font Rate Image variation tolerance is quite better than existing ones. Table 9 Google’s 45.2 % No preprocessing shows the average recognition rate is quite enhanced using hin.traineddata Shirorekha Chopping. The proposed Shirorekha chopping Parichit’s 22.3 % No preprocessing based preprocessing approach does not just improve the hin.traineddata recognition rate but also allows training only two or more Proposed 90.9 % Shirorekha touching conjunct characters along with basic characters and hin.traineddata Chopping isolated half characters. The single touching conjunct characters may be left out as these conjunct characters can Table 10: Processing Time Comparison easily be segmented using Shirorekha Chopping into those Total Characters basic components that were trained. This leads to the Processing Time in Test Image generation of comparatively smaller training database (Table 11). The proposed Approach runs faster than that of Google Google’s 2000 ms 94 and Parichit (Table 10). The extension to multiple fonts is hin.traineddata being done, from the perspective of Future scope. Parichit’s 1500 ms 94 hin.traineddata Proposed 1000 ms 94 hin.traineddata 22
no reviews yet
Please Login to review.