149x Filetype PDF File size 0.14 MB Source: www.diva-portal.org
A Segmentation-free Approach to Recognise Printed Sinhala Script H. L. Premaratne University of Colombo School of Computing, Sri Lanka Lalith.Premaratne@ide.hh.se hlp@mail.cmb.ac.lk J.Bigun School of Information Science, Computer and Electrical Engineering Halmstad University, S-301 18 Halmstad, Sweden Josef.Bigun@ide.hh.se Abstract symbols to produce the required vocal sound. Majority of character recognition algorithms The total number of different modifications from such as the use of ANNs needs segmentation of the entire alphabet including the basic characters the script prior to recognition. Contrast to is nearly 400. Although each character possesses Western scripts, Brahmi descended South Asian a distinct characteristic shape to distinguish from scripts such as Sinhala consist of modifier the others, some characters resemble with one or symbols, which make the segmentation a difficult more of the other characters by their appearance. task that needs to be addressed as a separate Some examples are given in Figure 1. issue. Further, the change of shape of the basic character (by violating modification rules) in the Modification of a character is carried out by modification process makes some modified simply adding one or more modifier symbols Sinhala characters impossible to segment. The before/after/above/below the character without proposed method, which uses Linear Symmetry affecting its general shape.. However this rule is to examine a co-relation between characters in violated for a specific subset of the alphabet the script with the testing alphabet, recognises numbering to 10 characters, in most of the characters directly within the image of the script. printed scripts, to give a better appearance A similar method is used to resolve confusing (Figure 2). Also, in some modifications, the joint characters. Experiments show highly favourable between the character and the modifier symbol is results not only for the basic characters of the smoothed to make the modified character appear alphabet but also for the modifier symbols. A as a single unit of symbol. novel but simple method using Linear Symmetry for skew correction has also been proposed. 1.2 Characteristics of the Script Key Words: Linear Symmetry, Recognition, A single line of script is organised in three Segmentation, Skew Correction horizontal layers. The middle layer contributing to approximately 50% of the total line height, 1. INTRODUCTION mainly include fifteen (15) basic characters and 1.1 Alphabet and the Modification Nine (9) modifier symbols. Twenty two (22) Process other basic characters occupy the middle layer The Sinhala script used by over 80% of the and the upper layer, with approximately 75% and 18.4 million population in Sri Lanka has been 25% of the total height of each character in each descended from the ancient Brahmi script and layer respectively. The middle and the lower evolved independently over many centuries. The layers include the remaining eight (8) characters, Sinhala language is unique to Sri Lanka and the with approximately 75% and 25% of the total Sinhala characters that are generally round in height of each character in each layer shape differ from all the other Brahmi descended respectively. Four (4) modifiers occupy the upper scripts in South Asia. The Sinhala alphabet layer while the remaining five (5) modifiers are consists of 18 vowels, 41 consonants and 17 assigned to the lower layer. The upper and the modifier symbols. A vowel may appear only as lower layers are of equal height each having 25% the first character of a word and a consonant is of the total line height. (Figure 4). modified using one or more of the modifier 1.3 The OCR Technology and Recent 2. RECOGNITION PROCESS Developments 2.1 Theory Optical Character Recognition (OCR) is the The theory used in the recognition process is the process of converting typed or printed documents orientation field tensor which has been used into machine-readable code. The original typed effectively in many applications over the past or printed documents scanned to form an image few years. A local neighbourhood with ideal file would be the input to the OCR software local orientation is characterised by the fact that system. The result is a picture represented as the gray value only changes in one direction. In light intensities on a rectangular grid of points, all other directions it is constant. Since the gray which do not yet identify individual characters. values are constant along lines, local orientation The OCR will in turn, recognise each character is also denoted as linear symmetry [1]. The linear or symbol in the image file and make them symmetry is also represented in the form a available in a suitable text editor, which could vector. Since the direction of a simple either be edited or modified. neighbourhood is different from the direction of a gradient, which is strictly cyclic, representation Most of the OCR systems use Artificial Neural of the linear symmetry needs the doubling of the Networks (ANN's) as the major tool. In addition angle of orientation. The vector that represents to the features identified in a rectangular grid of a the linear symmetry is composed of two matrix that encloses a single character, other quantities. One is the orientation angle and the features of the character such as the curvature other is the certainty measure. features and transition counts are also used. In the case of handwriting recognition, some 2.1.1 Mathematical representation common approaches are the ANN's, The local orientation is determined using the mathematical morphology, shape analysis and following three steps [1]. hidden Markov model (HMM). Each of the i. Select a local neighbourhood from the image above approaches has its own strengths and using a window function weaknesses. Researchers have achieved a ii. Fourier transform the windowed image significant improvement in performance by iii. Determine the local orientation by fitting a combining two or more of the above methods. straight line to the spectral density Majority of alphabets consists of confusing distribution. characters that resemble to each other to a greater extent. Resolving this problem especially in the When fitting a straight line, the sum of the case of handwriting recognition is a critical issue. squares of the distances of the data points are minimised. The research on the south and the South-East Since the minimisation of di is same as the Asian scripts lag behind that on European scripts maximisation of SI, the equation (2) is obtained. due to various reasons. The main reason is the complexity of a script. In Asian alphabets, the number of characters in the alphabet is high and The orientation is obtained as the eigen vector of the generation of a vocal sound by modifying a the largest eigen value of J. J can be rotated so character using modifier symbols is complex. that it is diagonalised. The rotation matrix is in Extensive research has been done on a few fact the eigen vector matrix given in equation (1). scripts used by a very large population of the community. Some of such research has been initiated in developed countries due to the high exposure to such research. Comparison of the diagonal elements on both sides of the equation (1) gives λ +λ = J + J ; At present, the OCR software for the 1 2 xx yy languages such as Sindhi, Bengali and Thai are λ - λ = (J - J )Cos2φ + 2 J Sin2φ available as commercial products. The research 1 2 xx yy xy on Devanagari and Tamil languages has achieved a tremendous progress. To the best of our Cos2φ = (J - J , +2J ) knowledge, there have been no or a very little xx yy xy research done on the recognition of printed Sin2φ Sinhala script. 2.2 . Determination of Skew Angle Almost all the recognition algorithms need the = I20 , Cos2φ I20 , I20 / I20 I20 text lines in the input image to be horizontal. Sin2φ Therefore, any skew associated with the input ∴∴ λ - λ = I image needs corrections prior to recognition. ∴∴ 1 2 20 Experiments show that the recognition algorithm proposed in this thesis tolerates a skew of +10 to - Define ∇f = ∂f/∂x + i (∂f/∂y) 10. The accuracy of recognition deviates considerably with the increasing skew. Therefore then I20 a robust method for skew correction needs to be =(∇f)2=((∂f/∂x)2-(∂f / ∂y)2 +2I(∂f/ ∂x). (∂f / ∂y)) 2 0 2 incorporated. = [ (ω +iω ) (ω - iω ) |F| ] = (λ -λ )exp(2iφ) x y x y 1 2 1 1 2 2 2 2 I =[ (ω +iω ) (ω - iω ) |F| ] = (ω + ω )|F| 11 x y x y x y Careful observation of a line of Sinhala script = ((∂f/ ∂x)2 + (∂f / ∂y)2 = λ + λ 1 2 shows that the boundary between the upper and the middle layers and the boundary between the Angle of I20 represents the (2 x angle) where the middle and the lower layers (fig. 8) possess the angle is the inclination angle of the fitting highest amount of energy in the horizontal orientation if the linear symmetry exists, and I11 direction. The horizontal projection of a sample represents the sum of the best and the worst total script clearly agrees with this concept. This is errors. due to the fact that any character in the alphabet should touch either at least one or both of these The Linear Symmetry algorithm that extracts the boundaries. Therefore, tracing the appearance of tensor is characterised by the fact that it delivers one of these boundaries in a skewed script could a dense orientation field along with certainties. In be used to determine the skew angle. Although case of high confidence on the existence of any straightforward method to detect a boundary orientation, the linear orientation represents the line could have been used, a more appropriate least change of gray values in one direction and method using the Linear Symmetry (LS) tensor maximal change in the orthogonal direction. has been proposed. Hence a Linear Symmetry Tensor for an image is constructed by averaging the orientation of the local neighbourhood, for each pixel of the image. The Linear Symmetry tensor [1] which gives information for each pixel of the image, on how it is organised with respect to the orientation 2.1.2 Implementation within a local neighbourhood, could effectively The LS Tensor for an image is built as explained be used to determine the orientation of the script. in the following steps. In general, the orientation angle of the resultant vector of all the vectors representing the LS for Four 1-D derivative filters dx (Gaussian kernal), each pixel of the image would provide a near dy (= - dx’) and gx (Gaussian kernal), approximation to the skew angle. In order to gy (= gx’) are generated. improve the accuracy, the interference to the final result from the following components The two derivative convolutions dxf (= should be elimination. convolution(gy, convolution(dx, Image)) and dyf (= convolution(gx, convolution(dy, Image)) i. Edges of the image of the original image with respect to x and y are ii. Background of the image, which consists of constructed using the above pair of filters. pixels having random orientations of low confidence. The LS Tensor (complex) is then given by iii. Other pixels (within the text area) having LS = (dxf +j∗dxy)^2 where j = √ (-1) orientations of low confidence. The correlation between the character being The results obtained for the LS tensor derived in section 3.3.2 yield the skew angle within +10 to – tested with the image is calculated using the 10 accuracy, which is well within the required formula accuracy for the recognition algorithm. absolute(convolution(conjugate(LS Tensor of 2.3 Recognition Procedure Character), LS Tensor of Image )). 2.3.1 Testing Database. of filtering is carried out in order to determine The recognition process is based on the the acceptance or rejection of the identified examination of the correlation of characters in character. A tertiary level of filtering is carried the script with each character of the alphabet out similarly. through a filtering operation. The testing It has been observed that, in addition to the alphabet which consists of all the characters highest value of correlation produced usually at (including the modifier symbols), is built by the centre of the character, a few more relatively extracting characters from an LS tensor. Each high values are also produced around the character in the testing alphabet is filtered (one at neighbouring pixels. This is due to the fact that a time) through the LS tensor of the script in the template of the testing character nearly order to identify its occurrences in the entire coincides with the neighbouring pixels around its script. The plot of correlation at each pixel (Fig. centre. This will result in recognising the same 10) shows that, each occurrence of the character character in the image more than once. being tested gives a strong correlation. A suitable Therefore, once the filtering has been performed, threshold that separates the required character non-maximums in a small neighbourhood (e.g. from the rest of the characters in the script, is 3x3) are suppressed in order to eliminate the then determined. This procedure is conducted for multiple acceptance of the same character. each and every character of the alphabet. During The recognition algorithm is as follows: this process, it has been observed that a total Input image number of 35 characters amounting to 60% of the Input database-of-characters alphabet separates from all the other characters */Alphabet/* with a clear threshold (Fig. 10(a)) while the Pre-process image balance 40% confuse with one or more Perform Horizontal-projection characters with similar shapes (Fig. 10(b)). Eight Extract Line-data (8) such confusing groups have been identified. ConstructLS-tensor Once all the different confusing groups are Read character identified, another level of filtering is carried out While not-end-of-alphabet do to separate each character within the confusing Filter characte with the LS Tensor group. The secondary level of filtering is performed to examine the correlation of a distinct */ Primary Filtering /* segment from one character with all the members Supress non-maximums in the group (Fig. 11). A suitable (secondary) While not-end-of-image do threshold that separates each character from the Segment occurrences above threshold rest is then determined. A further level of If confusing-charcater filtering is carried out if the confusion still Determine relative rhreshold occurs. Perform secondary-filtering The structure of the testing database is as /* and tertiary-filtering if necessary*/ follows. End-If Character Identifier Store image-coordinates of -each LS Tensor of character occurrence Primary Threshold End-While *** not-end-of-image *** Flag to indicate confusing status Update output array Secondary Threshold (for confusing characters) /* with ASCII Value, row, column no, .*/ Tertiary Threshold (for confusing characters) Read character End-While *** not-end-of-alphabet*** 2.3.2 Recognition. Sort output on Column No. within the Row No. The image is initially pre-processed to remove the background noise. The image is then scaled Since a character is identified directly within the (if necessary) to match the average height of a image of the script, the need to segment character to that of the testing alphabet. individual characters does not arise. Symbols Recognition of a script is performed by such as comma, full stop, question mark are also filtering the LS tensor of each character of the recognised with the same accuracy. testing alphabet with the LS tensor of the script. In each filtering cycle, all the occurrences of the character being tested are identified. If the testing character is a confusing one, the secondary level
no reviews yet
Please Login to review.