Language Pdf 102799 | Bel Arabi Advanced Arabic Grammar Analyzer

Partial capture of text on file.
                                                                                                                                                                                                                                                        
                                                 Bel-Arabi: Advanced Arabic Grammar Analyzer 
                                                                Michael Nawar Ibrahim, Mahmoud N. Mahmoud and Dina A. El-Reedy 
                                                                                                                                          grammar correction is not included right now. 
                          Abstract—This paper proposes a framework to automate the                                                          
                      grammar analysis of Arabic language sentences (لمجلا بارعإ). The                                                                           TABLE I: EXAMPLE OF GRAMMAR ANALYSIS 
                      grammar analysis is considered one of the complex tasks in the 
                      Natural Language Processing (NLP) field; since it determines the                                                           Word in              Transliterated           Grammatical                  Sign 
                      relation between noun and verb on the level of sentence, or noun                                                           Arabic               Word                      Role 
                      with the letter before it or after it or noun and a character on the                                                       دلاًلأا              Alawlad                   Subject                     Nominative 
                      last level of the preposition.  The construction of a rule-based                                                                                                                                       with damah 
                      high-accuracy  grammar analyzer is a complex, high resource                                                                ٌٌثؼهي               ylEbwn                    Present verb                Nominative 
                      consuming task. Then, we proposed a hybrid system between                                                                                                                                              with existing 
                      learning-based  approaches  and  rule-based  approaches,  which                                                                                                                                        noon 
                      provides  an  acceptable  accuracy  and  could  be  simply 
                      implemented. However the results of the proposed framework                                                                  يف                  fy                       Uninflected                  - 
                      are  really  promising  and  it  has  the  potential  to  be  further                                                                                                     particle 
                      improved.                                                                                                                  حقيدحنا              AlHadyqp                 Genitive noun                Genitive with 
                                                                                                                                                                                                                            kasrah 
                          Index  Terms—Arabic  natural  language  processing,  case                                                              غي                   mE                       Uninflected                             - 
                      ending diacritization, grammar analyzer.                                                                                                                                  circumstance 
                                                                                                                                                 ضؼت                  bED                      Possessive                   Genitive with 
                                                                                                                                                                                                                            kasrah 
                                                            I.  INTRODUCTION                                                                     ىى                   hm                       Uninflected                  - 
                          Arabic grammar analysis is the process of determining the                                                                                                             pronoun 
                      grammatical role, and case ending diacratization of each word                                                         
                      in  an  Arabic  sentence.  Grammar  analysis  is  distinct  from                                                         Second, as a nature of Arabic verbs, the verb could be in 
                      parsing, since it assign additional information like case ending                                                     passive or active voice e.g., (بزض, “drb”) could be read as 
                                                                                                                                            بزضضُ    (doreb, “beaten”) or  بزض (darab, “beat”), the system 
                                                                                                                                           بَ                                                بَ بَبَ
                      diacratization of each word. Grammatical role of a word is                                                              رِ
                      determined by the relation between a word and its dependents.                                                        assumes the verb as it is in the active voice. 
                      Grammar  analyses  are  flatter  than  regular  parsing  tree                                                            Third, the grammar analyzer does not prevent errors that are 
                      structures because they lack a finite verb phrase forms. Once                                                        related to incorrect use of semantic meaning, means that the 
                      the Arabic grammar analysis of a sentence is completed many                                                          semantic analysis is not verified. 
                      problems can be simply solved such as automatic diacritics,                                                              It  is  not  a  simple  matter  to  evaluate  the  Bel-Arabi 
                      Arabic sentences correction and accurate translation.                                                                framework, due to the absence of standard data for the Arabic 
                                                                                                                                           grammar analysis task. So, we have generated 600 sentences 
                          As example for the task of grammar analysis, let‟s consider                                                      for the evaluation of the system.  
                      the  sentence  “                       ىيضؼت غي حسردًنا حقيدح يف ٌٌثؼهي دلاًلأا”  to                                     This paper is organized as the following: in section 2, an 
                      grammatically analyze it. The output of the framework for                                                            overview of Arabic natural language processing is presented. 
                      such sentence is shown in Arabic in table I.                                                                         In section 3, previous work in the field of Arabic grammar 
                          The  proposed  framework  is  divided  into  five  main                                                          analysis is discussed. In section 4, the proposed framework is 
                      components. Three of them: Stemmer, Part of Speech Tagger                                                            explained.  The  data  collected  for  the  evaluation,  and  the 
                      (POS tagger), and Base Phrase chunker are learning-based.                                                            evaluation  process  are  presented  in  section  5.  Finally, 
                      The learning-based components use a “Conditional Random                                                              concluding remarks are presented in section 6. 
                      Field”  classifier  [1].  The  remaining  two  components:                                                                
                      Morphological Analyzer and Arabic Grammar Database are 
                      rule-based.                                                                                                                                        II.  ARABIC NLP AND DATA 
                          The proposed framework covers the basic grammar rules 
                      for verbal and nominal sentence. However, it has the following                                                           There  are  three  main  categories  of  Arabic  language; 
                      limitations:                                                                                                         classical – the language of Qur‟an, modern standard (MSA) – 
                          First, the system is assuming that sentence has been written                                                     which is a simplified form of classical that is extracted from 
                      correctly,  whether  morphologically  or  grammatically,  and                                                        news and written documents, and dialectical Arabic which 
                                                                                                                                           differs from one country to another. One variation of it is the 
                                                                                                                                           colloquial  language  which  is  the  daily  used  language  by 
                          Manuscript received February 5, 2014; revised March 24, 2014.                                                    Egyptians. 
                          M. N. Nawar, M. N. Mahmoud and D. A. El-Reedy  are with the Computer                                                 In general Arabic has a very rich morphological language 
                      Engineering Department, University of Cairo, Giza, 12613 Egypt (e-mail: 
                      michael.nawar@eng.cu.edu.eg;                                                     mah.nabil@ieee.org;                 where each word can include number, gender, aspect, case, 
                      dina.elreedy@gmail.com). 
                                                                                                                                                                
              mood, voice, mood, person, and state.  The Arabic basic word               tokenizer (TOK), part of speech tagger (POS) and base phrase 
              form can be attached to a set of clitics representing object               chunker (BPC) - shallow syntactic parser. The technology of 
              pronouns,  possessive  pronouns,  particles  and  single  letter           AMIRA is based on supervised  learning  with  no  explicit 
              conjunctions. Obviously the previous features of Arabic word               dependence  on  knowledge  of  deep  morphology;  hence,  in 
              increase  its  ambiguity.  Generally  Arabic  stems  can  be               contrast to systems such as MADA, it relies on surface data to 
              attached three types of clitics ordered in their closeness to the          learn generalizations. In general the tools are based on using a 
              stem according to the following formula:                                   unified framework casting each of the component problems as 
                                                                                         a classification problem. 
                   {[proclitic1] {[proclitic2] {Stem [Affix][ Enclitic]}                    Also, one of the large groups interested in Arabic NLP is 
                                                                                         RDI Egypt. RDI has been one of the regional and international 
              where  proclitic1  is  the  highest  level  clitics  that  represent       leading key players in the R&D of Arabic Human Language 
              conjunctions  and  is  attached  at  the  beginning  such  as  the         Technologies for the last 10 years. RDI provides automatic 
              conjunction [)ً, w, „and‟ (,)ف , f, „then‟ (]. Proclitic2 represent        Arabic  diacritizer  [8],  Arabic  morphological  analyzer  [9], 
              particles [)ب, b, „with/in‟( ,)ل , l, „to/for‟( (ك, k, „as/such‟( ].       Arabic part-of-speech tagger [10], Arabic Lexical Semantic 
              Enclitics represent pronominal clitics and are attached to the             Analyzer [11], Text to Speech System, Arabic Text Search 
              stem directly or to the affix such as pronoun [( ه , h , ‟his‟), ( ىى ,    Engine, and Arabic Lexical Dictionaries. 
              hm , „their/them‟)].                                                          Finally, Stanford natural language processing group, which 
                 The following is an example of the different morphological              is a group for natural language processing research scientists, 
              segments  in  the  word              وذاردقتً that  has  the  stem         postdocs,  programmers and students, is developing Arabic 
              ( ردق ,qdr ,power), the proclitic conjunction  )ً, w, „and‟ ( , the        NLP tools. The developed Arabic NLP products are a word 
              proclitic  particle  )ب ,  b  ,„with/in‟(  ,  the  affix  )خا,  At  ,for   segmenter [12], state-of-the-art part-of-speech tagger [13] and 
              plural )  ,and the cliticized pronoun ( ه , h , ‟his‟).                    a high performance probabilistic parser [14] the data set used 
                 The set of proclitics considered in this work are the particles         is the Penn Arabic Treebank [15]. 
              prepositions {b, l, k}, meaning {by/with, to, as} respectively,                
              and the conjunctions {w, f}, meaning {and, then} respectively. 
              Arabic words may have a conjunction and a preposition and a                   IV.  ARABIC GRAMMAR ANALYSIS CURRENT RESEARCH 
              determiner cliticizing to the beginning of a word. The set of                 Although the importance or Arabic grammar analysis, few 
              possible  enclitics  comprises  the  pronouns  and  (possessive            researchers tried to solve the issue of grammar analysis. There 
              pronouns) {y, nA, k, kmA, km, knA, kn, h, hA,hmA, hnA, hm,                 are two main techniques used to deal with grammar analysis 
              hn}, respectively, my (mine), our (ours), your (yours), your               for  Arabic  language:  rule-based  technique,  and  parsing 
              (yours) [masc. dual], your (yours) [masc. pl.], your (yours)               technique.  
              [fem. dual], your (yours) [fem.pl.], him (his), her (hers), their             Al Daoud et al. [16] propose a framework to automate the 
              (theirs) [masc. dual], their (theirs) [fem. dual], their (theirs)          grammar analysis of Arabic language sentences in general, 
              [masc. pl], their (theirs) [fem. pl.]. An Arabic word may only             although it focuses on the simple verbal sentences but it can be 
              have a single enclitic at the end. We define a token as a (stem            extended  to  any  Arabic  language  sentence.  This  system 
              + affixes), proclitics, enclitics, or punctuation.                         assumes that the entered sentences are correct lexically and 
                                                                                         grammatically. This system assumes that verb as it is in the 
                                  III.  ARABIC NLP SYSTEMS                               active voice. 
                                                                                            Attia  [2],  [3]  investigates  different  methodologies  to 
                 For the last two decades concentration on Arabic language               manage  the  problem  of  morphological  and  syntactic 
              processing  has  focused  on  morphological  analysis.  In  this           ambiguities in Arabic. He built an Arabic parser using Xerox 
              field, many working systems have been achieved [2]-[4]. Few                linguistics environment which allows writing grammar rules 
              systems for more complicated NLP tasks are developed.                      and notations that follow the LFG formalisms. Attia tested his 
                 One of the developed NLP systems is MADA and TOKAN                      approach on short sentences randomly selected from a corpus 
              [5],  [6],  which  is  a  suite  of  tools  for  morphological             of news articles; he claimed a performance of 92%. 
              disambiguation,  POS  tagging,  diacritization,  lexicalization,              Habash  et  al.  [17]  construct  The  Columbia  Arabic 
              lemmatization  stemming  and  other  tasks.  MADA  and                     Treebank  (CATiB).  Columbia  Treebank  is  a  database  of 
              TOKAN have  been  done  on  addressing  different  specific                syntactic analyses of Arabic sentences. CATiB contrasts with 
              natural  language  processing  tasks  for Arabic. MADA is a                previous approaches to Arabic Treebanking in its emphasis on 
              system for Morphological Analysis and Disambiguation for                   speed with some constraints on linguistic richness. Two basic 
              Arabic.      TOKAN          is    a     general      tokenizer      for    ideas inspire the CATiB approach: no annotation of redundant 
              MADA-disambiguated  text.  In  simple  words,  the  MADA                   information  and  using  representations  and  terminology 
              system along with TOKAN provide one solution to different                  inspired by traditional Arabic syntax. So the task of grammar 
              Arabic NLP problems.                                                       analysis can be done by applying a simple parsing approach.  
                 Other developed system for different Arabic NLP problems                   Duke  et  al.  [18]  constructed  the  Quranic  Arabic 
              is  the  AMIRA system [7]. AMIRA is a toolkit for Arabic                   Dependency  Treebank  (QADT),  which  is  an  annotated 
              tokenization,  POS  tagging,  Base  Phrase  Chunking,  and                 linguistic  resource  consisting  of  77,430  words  of  Quranic 
              Named Entities Recognition. AMIRA is a successor suite to                  Arabic. This project differs from other Arabic tree banks by 
              the  ASVMTools.  The  AMIRA  toolkit  includes  a  clitic                  providing  a  deep  computational linguistic  model based on 
                                                                                                                                                                      
               historical traditional Arabic grammar.  
                  Most of the related work reported in this study concentrated 
               on short sentences and used hand-crafted grammars, which 
               are  time-consuming  to  produce  and  difficult  to  scale  to 
               unrestricted  data.  Also,  these  approaches  used  traditional 
               parsing  techniques  like  top-down  and  bottom-up  parsers 
               demonstrated  on  simple  verbal  sentences  or  nominal 
               sentences with short lengths. 
                   
                                V.  THE PROPOSED FRAMEWORK                                                   Fig. 1. Proposed Framework Architecture.                  
                  The proposed framework takes an input of sentence, and it                                                        
               assigns each token an appropriate tag, case, and a sign as                      B.  Framework Components Description 
               follow: 
               Arabic tags :{present verb (عراضي مؼف) , imperative verb ( مؼف                  1)  Morphological analyzer 
               زيأ)  , past verb (يضاي مؼف) , doer (مػاف) , direct object (وت لٌؼفي)            The  morphological  analyzer  is  based  on  BAMA-v2.0 
               , cognate accusative (قهطي لٌؼفي) , cognate accusative delegate               (Buckwalter Arabic morphological analyzer version 2.0) [19], 
               ( قهطًنا لٌؼفًهن ةئان) , subject (أدرثي) , predicate (زثخ) , delayed          and it contains additional features like the extraction of the 
               subject (زخؤي أدرثي) , ena subject (ٌإ ىسأ) , ena predicate(ٌإ زثخ) ,         pattern  of  the  word.  For  example,  the  pattern  of 
               kan subject (ٌاك ىسأ) , kan predicate (ٌاك زثخ), kad subject ( ىسأ            (“ةذاك”,”kAtb”)  is  (“مػاف”,”fAEl”)  and  the  pattern  of 
               داك) , kad predicate (داك زثخ), apposition (لدت) ,adjective (دؼن) ,           (“ةركو”,”mktb”) is (“مؼفي”,”mfEl”). Also, it could be used to 
               incorporeal emphasis  (يٌنؼي ديكٌذ) , verbal emphasis ( ديكٌذ                 extract  the  root  of  the  word.  For  example,  the  root  of 
               يظفن) , conjunction (فٌطؼي), possessive (وينا فاضي)  , genitive               (“ةذاك”,”kAtb”)       is    (“ةرك”,”ktb”)       and     the    root    of 
               noun  (رًزجي ىسأ)   ,  specifier  (زييًذ)  ,  exception  (ينثرسي)  ,          (“ةركو”,”mktb”)  is  (“ةرك”,”ktb”).  Also,  the  morphological 
               vocative  (يداني)  ,  circumstance  (فزظ)  ,  pronoun  (زيًض)  ,              analyzer is developed to determine if a word is definite or not, 
               particle  ena( خسان فزح)  ,  accusative  particle  (ةصن فزح)  ,               is masculine or feminine, is plural or dual or singular.  
               jussive particle (وزج فزح) , preposition (زج فزح) , exception 
               particle (ءانثرسا فزح) , coordinating conjunction (فطػ فزح)  ,                  2)  Stemmer 
               vocative particle (ءادن جادأ) , realization particle (قيقحذ فزح) ,               The stream of characters in a natural language text must be 
               diminishing particle (ميهقذ فزح) , punctuation (زييزذ حيلاػ) ,                broken up into distinct meaningful units (or tokens) before any 
               particle (فزح) }.                                                             language processing. The stemmer is responsible for defining 
               Arabic  cases:  {nominative  (عٌفزي),  accusative  (خاتٌصنًنا),               word boundaries, demarcating clitics, multiword expressions, 
               genitive (رًزجي), jussive (وًزجي), and uninflected (ينثي)}.                   abbreviations and numbers. 
               Arabic signs :{fatha (ححرفنا) ,removing noun(ٌٌننا فذح)    ,                     In this task, the classifier takes an input of raw text, without 
               removing weak ending letter (فذح  حهؼنا فزح), kasra(جزسكنا),                  any processing, and assigns each character the appropriate tag 
               damah (حًضنا), sukun (ٌٌكسنا), waw and noun (ٌٌنناً ًاٌنا), ya'               from  the  following  tag  set  {B-PRE1,  B-PRE2,  B-WRD, 
               and noun (ٌٌنناً ءاينا), alef and noun (ٌٌنناً فنلأا)}.                       I-WRD,  B-SUFF,  I-SUFF}.  Where  I  denotes  inside  a 
                  For each token in the sentence, knowing its POS tag, BP                    segment, B denotes beginning of a segment, PRE1 and PRE2 
               chunk and its morphological features like: token definiteness,                are proclitic tags, SUFF is an enclitic, and WRD is the stem 
               we use a rule based system to determine the tag, case, and sign               plus  any  affixes  and/or  the  determiner  Al.  These  tags  are 
               of each word in the sentence. 
                  The  grammar  analyzer  input  and  features  could  be                    similar to the tags used by Diab et al. [20]. 
               characterized as follow:                                                         The  classifier  training  and  testing  data  could  be 
               Input: A complete sentence of Arabic words.                                   characterized as follow: 
               Context: The whole sentence.                                                  Input: A sequence transliterated Arabic characters processed 
               Features: To extract the grammatical role of the words of the                 from left-to-right with break markers for word boundaries. 
               sentence, we use stemmer, POS tagger, BP chunker, and a                       Context: A fixed-size window of -5/+5 characters centered at 
               morphological  analyzer  to  extract  extra  morphological                    the character in focus. 
               features of the words in the sentence.                                        Features: All characters and previous tag decisions within the 
                 A.  The Architecture of the Framework                                       context, and the characters corresponding to the word patterns 
                                                                                             with the context. 
                  The  framework  is  presented  in  figure  1.  The  Arabic                   3)  Part of Speech Tagger 
               grammar analyzer module uses stemmer to separate proclitics                      POS tagging represents the task of marking up a word in a 
               and enclitics of the word. Then the POS tagger assigns an                     text as corresponding to a particular part of speech, based on 
               adequate  POS  tag  to  each  token.  Then,  the  base  phrase                both its definition, as well as its context. There are basically 
               chunker  groups  words  belonging  to  the  same  phrases.                    two difficulties in POS tagging. The first one is the ambiguity 
               Additional morphological information extracted for each word                  in the words, meaning that most of the words in a language 
               using  the  morphological  analyzer.  Finally,  it  applies  the              have more than one part of speech. The second difficulty arises 
               Arabic grammar rules to assign a tag, case and sign for each                  from the unknown words, the words for which the tagger has 
               word. 
                                                                                                                                                 
             no knowledge about.                                                  
               In this task, the POS tagger takes an input of tokenized text, 
             and it assigns each token an appropriate POS tag from the                      VI.  EVALUATION OF THE FRAMEWORK 
             Arabic Treebank collapsed POS tags, which comprises 24                 For  the  evaluation  of  the  Bel-Arabi  Advanced  Arabic 
             tags as follows: {ABBREV, CC, CD, CONJ+NEG PART,                    grammar analyzer, first the data used for the evaluation will be 
             DT,  FW,  IN,  JJ,  NN,  NNP,  NNPS,  NNS,  NO  FUNC,               discussed, then the evaluation measures and results used will 
             NUMERIC_COMMA, PRP, PRP$, PUNC, RB, UH, VBD,                        be discussed. 
             VBN, VBP, WP, WRB}.                                                   A.  The Evaluation Data 
               The  classifier  training  and  testing  data  could  be 
             characterized as follow:                                               For the evaluation of this framework, we have generated 
             Input: A sequence of transliterated Arabic tokens processed         600 sentences. The 600 sentences consist of 3452 tokens. The 
             from left-to-right with break markers for word boundaries.          sentences  lengths,  tags,  cases  and  signs  are  distributed  as 
             Context:  A  window  of  -2/+2  tokens  centered  at  the  focus    shown in table II, III, IV and II respectively. 
             token.                                                                  
             Features: Every character N-gram, N<=4 that occurs in the           TABLE III: GRAMMAR ANALYSIS TEST SENTENCES LENGTH DISTRIBUTION 
             focus token, the 5 tokens themselves, POS tag decisions for 
             previous tokens within context, and the patterns of the words                  Sentence Length                 Count 
             within the context.                                                                   2                         25 
                                                                                                   3                         76 
               4)  Base Phrase Chunker                                                             4                         87 
               Chunking represents the task of recovering only a partial                           5                         113 
                                                                                                   6                         81 
             amount  of  syntactic  information  to  identify  phrases  from                       7                         85 
             natural  language  sentences    It  is  the  process  of  grouping                    8                         60 
             consecutive  words  together  to  form  phrases,  also  called                        9                         43 
             Shallow parsing Chunking does not provide information on                              10                        22 
             how the phrases attach to each other. The structures generally                        11                         3 
             specified by shallow parsers include phrasal heads and their                          12                         5 
             immediate and unambiguous dependents and these structures            
             are usually non-recursive.                                                         TABLE IV: GRAMMAR ANALYSIS TAGS 
               In this task, the BP Chunker takes an input of tokenized                           Tag                         Count 
             text,  and  it  assigns  each  token an appropriate Base Phrase                  present verb                      193 
             Chunk tag from the Arabic Treebank collapsed BPC tags .                            past verb                       105 
             Nine types of chunked phrases are recognized using a phrase                     imperative verb                    15 
                                                                                                  doer                          191 
             BIO tagging scheme, Inside (I) a phrase, Outside (O) a phrase,                   direct object                     227 
             and Beginning (B) of a phrase. The 9 chunk phrases identified                       subject                        299 
             for Arabic are PP, PRT, NP, SBAR, INTJ, and VP. Thus the                           predicate                       157 
                                                                                             delayed subject                    20 
             task is a one of 12 classification task (since there are I and B                  ena subject                      51 
             tags for each chunk phrase type except PRT, and a single O                       ena predicate                     35 
             tag).                                                                             kan subject                      49 
               The  classifier  training  and  testing  data  could  be                       kan predicate                     38 
                                                                                               kan subject                      26 
             characterized as follow:                                                          apposition                       147 
             Input: A sequence of transliterated Arabic tokens processed                        adjective                       155 
             from left-to-right with break markers for word boundaries.                        conjuction                       95 
             Context:  A  window  of  -2/+2  tokens  centered  at  the  focus                  possessive                       287 
                                                                                              genitive noun                     183 
             token.                                                                             specifier                       35 
             Features: Every character N-gram, N<=4 that occurs in the                        circumstance                      66 
             focus token, the 5 tokens themselves, POS tag decisions for                        pronoun                         216 
             previous tokens within context and the previous Base phrase                 coordinating conjunction               101 
                                                                                                particle                        217 
             tag .                                                                             Other Tags                       544 
               5)  Arabic Grammar Rules Databas 
               It  consists  of  about  four  hundred Arabic grammar rules,                     TABLE V: GRAMMAR ANALYSIS CASES 
             when applied to the sentence after the extraction of the features                    Case                          Count 
             like: POS tag, BP tag, and the pattern; it will assign a tag, a                   nominative                       1081 
             case  and  a  sign  to  each  token  in  the  sentence.  After  the               accusative                        557 
             execution of all the rules, if some tokens remain without a tag,                    jussive                          58 
             they  will  be  given  a  default  one.  As  Example  of  Arabic                   genitive                         602 
                                                                                               uninflected                      1154 
             grammar rule: any noun after a preposition is a genitive noun.                                       
             Another  example  of  the  grammar  rules:  any  noun  after  a                                      
             vocative particle is a vocative.
The words contained in this file might help you see if this file matches what you are looking for:

...Bel arabi advanced arabic grammar analyzer michael nawar ibrahim mahmoud n and dina a el reedy correction is not included right now abstract this paper proposes framework to automate the analysis of language sentences table i example considered one complex tasks in natural processing nlp field since it determines word transliterated grammatical sign relation between noun verb on level sentence or role with letter before after character alawlad subject nominative last preposition construction rule based damah high accuracy resource ylebwn present consuming task then we proposed hybrid system existing learning approaches which noon provides an acceptable could be simply implemented however results fy uninflected are really promising has potential further particle improved alhadyqp genitive kasrah index terms case me ending diacritization circumstance bed possessive introduction hm process determining pronoun diacratization each distinct from second as nature verbs parsing assign addition...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area