241x Filetype PDF File size 0.57 MB Source: aclanthology.org
Finite State Automata and Arabic Writing
Michel Fanton
CERTAL-INALCO 1
73 rue Broca
F75013 Paris France
email : certal2@ext.jussieu.fr
Abstract isolated final medial initial
Arabic writing has specific features, which im-
ply computational overload for any arabicized or present only small variations : for example
software. Finite state automata are well known letter ~r* (s)
to give efficient solutions for translation prob- isolated final medial initial
lems which can be formalized as regular lan-
guages. These automata are as more easily built
that their alphabet have been reduced through a Letters which cannot be bound to the next
careful linguistic analysis. This reduction makes one have only two shapes, for example letters
it possible to write directly an automaton with-
out going through the intermediate stage of con- (d) and .~ (w and fi)
textual rules, which have to be translated into isolated final isolated final
an automaton for the sake of efficiency. This
paper presents two Moore automata, the first
one, taken as an example, gives a solution to the During the seventies and the beginning of the
choice of right shape for a letter to be printed eighties, hard controversies took place within
or displayed (usually known as contextual anal- the Arabs concerned with these questions, lin-
ysis), the second one studies the more complex guists and computer scientists. Finally in 1983
problem of determining the right carrying letter the ASMO (Arab Society for Normalization
for hamza. Every arabicized software has to face which unfortunately does not exist any more),
these questions and finite state automata are influenced by Pr. Lakhdar-Ghazal from IERA
certainly a good answer to them. (Rabat Morocco) chose to give a unique code to
INTRODUCTION all shapes of one particular letter. This is cer-
tainly a good choice from a linguistic point of
Arabic writing has specific features, which im- view, but even so, compromises had to be made
ply computational overload for any arabicized to take into account writing habits that con-
software. The first one, well known now for flicted with it. Letter hamza is the most notice-
many years, is the fact that Arabic printing tries able example of such a compromise for reasons
to imitate handwriting. Because of this, conso- we shall explain later.
nants and long vowels can have four or only two 1 CONTEXTUAL ANALYSIS
shapes depending of their ability to be bound to
the following letter and of where they appear in Whatever be the choice made for coding, from
the word. a typesetting or a computational point of view,
These shapes can be very different : for example there must be different codes for the different
letter o 2 (h) shapes of a letter. So every arabicized software
has to use two systems for coding : the reduced
ICERTAL : Centre d'l~tudes et de Recherche en code we have just introduced and the extended
Traitement Automatique des Langues, INALCO : In- code in which the different shapes have different
stitut National des Langues et Civilisations Orientales
~the Arabic parts of this paper have been typeset using Klaus Lagally's ArabTEX
26
codes. Up to UNICODE, no normalization exists Mealy automata are certainly a better choice
for the second one. So every arabicized software when bidirectional applications are considered.
has to solve the problem of choosing the right As the question is to identify succession of sym-
shape of every printed or displayed letter. bols of a certain type we found it clearer to use
1.1 Rules for letter shape a Moore automaton.
determination 1.3 A Moore automaton for contextual
This determination, frequently known as con- analysis
textual analysis can be summarized into the fol- 1.3.1 Source language of the automaton
lowing set of unformal rules: It follows from the determination rules that we
1. At the beginning of a word: only need to know what particular letter we are
dealing with only at the output stage. All we
• If the letter is a binding letter it takes have to know is wether it is a binding or a non
the INITIAL shape. binding letter 4. The alphabet of the automaton
• If it is a non binding one it takes the should be A = (#} [J L where L is the set of
ISOLATED shape. arabic letters present in the reduced code and
# the word boundaries. The set of letters will
2. In the middle of a word (there is at least then be partitioned into three sets •
one letter following the current one):
(a) If the letter is a binding letter then A--+ A'- {{#},N,B}
• If it follows a binding letter it takes N being the set of non binding letters and B the
the MEDIAL shape. set of binding letters. If we denote respectively
• If it follows a non binding letter it n and b an arbitrary element of each of these
takes the INITIAL shape. sets, the source language of the automaton can
(b) If the letter is a non binding letter be reduced to:
• If it follows a binding letter it takes A1 = {#, n, b}
the FINAL shape.
• If it follows a non binding letter it L1 = {#(n Vb)'#}
takes the ISOLATED shape.
3. At the end of a word (for both types of where V denotes disjunction and • is the Kleene
letters) star
• If it follows a binding letter it takes the 1.3.2 Grammar and automaton for L1
FINAL shape. Language L1 can be generated by the simple
grammar :
• If it follows a non binding letter it
takes the ISOLATED shape. m -+ #A#
1.2 Moore and Mealy automata A A( lb)
Moore automata are state assigned output ma- or the as simple automaton :
chines : the output function assigns output initial states = {1}
symbols to each state. They differ from Mealy final states: -- {5}
automata, transition assigned finite state ma- transitions
chines, where output symbols are associated 4As this question has only been taken as an example,
with transitions between states. Mealy au- the alphabet has been oversimplified. A full working
tomata are sometimes called finite transducers. automaton should cope, as far as arabic is concerned,
The two machine types have been demonstrated with two additional problems : hamza on the line to
to produce the same input-output mappings 3. which no preceding letter can be bound to and l~m alif
ligature. It should also give a proper treatment of non
3see (Aho and Unman, 1972) and (Hopcroft and Ull- arabic letters and symbols. But this would not affect the
man, 1979) for a full account of these matters here described method.
27
This automaton is clearly nondeterministic.
b n # output This is due to the fact that a letter from B
100 2 0 can appear in final or isolated shape when sit-
2340 # uated at the end of a word, in initial or medial
334 5 b shape when another letter follows it. Because
434 5 n of this nondeterministic feature, every transi-
5000 # tion should appear as a set. When this set is a
singleton, the "only" state has been put without
1.3.3 Target language of the automaton braces for an easier reading.
The alphabet for the target language L2, given It can be easily augmented to take account of
what has been said before and using the same occasional short vowels or shadda 5 (') that could
method of partioning and then reducing the al- occur : the transitions to add would force the
phabet could be at first sight: automaton to loop onto the same state, what-
ever be it since vowels or shadda can only ap-
A2 = {#,I,i,m,f} pear after a consonant and do not influence its
shape.
where I denotes a letter in isolated shape, i, m 1.3.5 PROLOG test program
and f stand for initial, medial and final shape. This program is a straightforward translation of
But letters from N have only two shapes final the above described grammar and automaton.
and isolated. Moreover isolated and final shapes The predicate test allows to limit the genera-
of letters from B can only appear at the end of a tion of inputs to a given length. In the results
word, which is not the case for the correspond- we chose to limit the length of the input to 6
ing shapes of letters from N. So, the following included word boundaries.
modified version of A2 will be prefered :
X
A2 = {#,In, Ib, i,m,f~,fb} 7, generation of elements of LI
where In stands for isolated shape of a letter X
from N and so on. With these symbols the tar- m--> [#],a,[#].
get language L2 can be described by the regular a --> ([n];l'b]).
expression : a --> a,([n];[b]).
X
L2 = {#(I~im'fnI~)'(Ib V fb V E)#} 7, translation automaton
Y.
where E denotes as usual the empty string. init ial_stat e ( 1).
1.3.4 Translation automaton final_state (8).
The translation process of a sequence of LI into tr(1,#,l) tr(4,b,5)
a legal sequence of L2 can be operated through tr(1 ,n,2) tr(4,b,4)
the following automaton : tr(1 ,b,3) tr(4,n,6)
initial states = {1} tr(1,b,7) tr(5,#,8)
final states = {8} tr(2,#,8) tr(6,#,8)
transitions : tr(2,n,2) tr(6,n,2)
n b # output tr(2,b,3) tr(6,b,3)
1 2 {3,7} # tr(2,b,7) tr(6,b,7)
2 2 {a,7} 8 I. tr(3,n,6) tr(7,#,8)
3 6 {4,5} @ i tr(3,b,5)
4 6 {4,5} @ m tr(3,b,4)
5 0 0 8 A output (I, #). output (5,fb).
6 2 {3,7} S f. output(2, 'In'). output(6,fn).
7 0 0 S Ib
8 o o o # 5sign denoting a double letter
28
input output
output(3,i). output(7,'Ib'). [#,b,b,n,n,#]
output(4,m). output(8,#). [#,i,m,fn,In,#]
[#,b,b,n,b,#] [#,i,m,fn,Ib,#]
forme(Input,Output):- [#,b,b,b,n,#] [#,i,m,m,fn,#]
initial_state(Is), [#,b,b,b,b,#] [#,i,m,m,fb,#]
path(Is,Fs,lnput,Output),
final_state(Fs). 2 WRITING OF LETTER HAMZA
The hamza can be written in five different man-
path(S,S,[],[]). $
path(SI,S2,[XIXs],[YIYs]):- ners (I, !, 3, ~, ') depending mainly upon:
tr(SI,X,S), • its position within the word
output(S,Y),
path(S,S2,Xs,Ys). • the preceding and the following vowel
test(L):- As the choice made for coding, was to adhere
m(M,[]), to a linguistic point of view, there should have
length(M,L1), been only one code for all these shapes and car-
((LI > L,!,nl,fail);true), rying consonants. But, as it has just been said,
printing_form(M,F), to determine the correct writing of hamza, one
nl,write(M),tab(1),write(F),fail. has to know the surrounding vowels, and it is of
test(_). common knowledge that the Arabs do not usu-
ally write short vowels. These essential data be-
1.3.6 Program results ing missing, no algorithm can take place to ful-
fil this task for a common usage such as display
input output a text on a screen. Thus, the ASMO decided
[#,n,#] [#,In,#] to have distinct codes for the different carriers
[#,b,#] [#,Ib,#] of hamza, but not of course for their different
[#,n,n,#] [#,In,In,#] shapes which can be determined as seen before.
[#,n,b,#] [#,In,Ib,#] So why is this question of any interest ? If we
[#,b,n,#] [#,i,fn,#] consider NLP applications for Arabic, it could
[#,b,b,#] [#,i,fb,#] worth considering this problem at generation
[#,n,n,n,#] [#,In,In,In,#] stage. For instance many vowel alternations
[#,n,n,b,#] [#,In,In,Ib,#] occur in the conjugation of verbs, and when a
[#,n,b,n,#] [#,In,i,fn,#] hamza is present in the verb root, the hamza
[#,n,b,b,#] [#,In,i,fb,#] writing will vary accordingly.
[#,b,n,n,#] [#,i,fn,In,#] For example the verb I~ qara'a - he (has)
[#,b,n,b,#] [#,i,fn,Ib,#]
[#,b,b,n,#] [#,i,m,fn,#] read-changes to 5.~." yaqra'fna-they read
[#,b,b,b,#] [#,i,m,fb,#] o
[#,n,n,n,n,#] [#,In,In,In,In,#] (present) - and to ~.z~ quri'a - it (has) been
[#,n,n,n,b,#] [#,In,In,In,Ib,#] read. And at the generation stage vowels are
[#,n,n,b,n,#] [#,In,In,i,fn,#] known even if we decided not to write them.
[#,n,n,b,b,#] [#,In,In,i,fb,#] The only alternative would be to put all the
[#,n,b,n,n,#] [#,In,i,fn,In,#] forms in a dictionary. At CERTAL, our philos-
[#,n,b,n,b,#] [#,In,i,fn,Ib,#] ophy is to use all the possible means to reduce
[#,n,b,b,n,#] [#,In,i,m,fn,#] the size of dictionaries. Hence this question ap-
[#,n,b,b,b,#l [#,In,i,m,fb,#] peared to us worth studying.
[#,b,n,n,n,#l [#,i,fn ,In ,In,#] 2.1 Rules of hamza writing
[#,b,n,n,b,#] [#,i,fn,In,Ib,#]
[#,b,n,b,n,#] [#,i,fn,i,fn,#] 1. When a hamza is at the beginning of a word
[#,b,n,b,b,#] [#,i,fn,i,fb,#] it is written
29
no reviews yet
Please Login to review.