Swedish Grammar Pdf 102336

Partial capture of text on file.
                         Spelling and Grammar  Correction for Danish in SCARRIE 
                                                               Patrizia  Paggio 
                                                          Center for Sprogteknologi 
                                                               Copenhagen (DK) 
                                                               patrizia@cst, ku. dk 
                                       Abstract                             Swedish (Hein, 1998). 
                   This paper reports on work carried out to de-              This  paper  describes  the  prototype  of  a 
                   velop a spelling and grammar corrector for Dan-          spelling  and  grammar  corrector  for  Danish 
                   ish, addressing in particular the issue of how a         which  combines  traditional  spelling  checking 
                   form of shallow parsing  is  combined with er-           functionalities with the ability to carry out com- 
                   ror detection and correction for the treatment           pound analysis and to detect and correct cer- 
                   of context-dependent spelling errors.  The syn-          tain types of context-dependent spelling errors 
                   tactic grammar for Danish used by the system             (hereafter  simply  "grammar errors").       Gram- 
                   has been developed with the aim of dealing with          mar  correction  is  carried  out  by  parsing  the 
                   the most frequent error types found in a parallel        text, making use of feature overriding and error 
                   corpus of unedited and proofread texts specif-           weights to accommodate the errors.  Although 
                   ically collected by the project's end users.  By         a  full parse of each sentence is attempted, the 
                   focussing on certain grammatical constructions           grammar has been developed with the aim of 
                   and  certain  error  types,  it  has  been  possible     dealing only with the most frequent error types 
                   to  exploit the linguistic 'intelligence' provided       found in a parallel corpus of unedited and proof- 
                   by syntactic parsing and yet  keep the system            read texts specifically collected by the project's 
                   robust  and efficient.  The system described is          end users.  By focussing on certain grammati- 
                   thus superior to other existing spelling checkers        cal constructions and certain error types, it has 
                   for Danish in its ability to deal with context-          been possible to  exploit the  linguistic 'intelli- 
                   dependent errors.                                        gence'  provided by  syntactic  parsing  and  yet 
                                                                            keep the system robust and efficient.  The sys- 
                   1    Introduction                                        tem described is thus superior to other existing 
                   In her much-quoted and still relevant review of          spelling checkers for Danish in its ability to deal 
                   technologies for automatic word correction (Ku-          with certain types of grammar errors. 
                   kich,  1992),  Kukich observes that  "research in          We begin by giving an overview of the sys- 
                   context-dependent spelling correction is  in  its        tem's components in Section 2.  In Section 3 we 
                   infancy" (p.  429), and that the task of treating        describe the error types we want to deal with: 
                   context-dependent errors is still an elusive one         Section 4  gives  an  overview  of the  grammar: 
                   due to the complexity of the linguistic knowl-           in particular, the methods adopted for treating 
                   edge often necessary to analyse the context in           feature mismatches and structural errors are ex- 
                   sufficient  depth  to  find  and  correct  such  er-     plained. Finally, in Section 5 evaluation results 
                   rors.  But  progress  in parsing technology and          are presented and a conclusion is drawn. 
                   the growing speed of computers seem to have              2    The  prototype 
                   made  the  task  less  of a  chimera.      The  '90s     The  prototype  is  a  system  for  high-quality 
                   have in fact  seen a  renewed interest  in  gram-        proofreading  for  Danish  which  has  been  de- 
                   mar checking,  and proposals have been made              veloped in the context of a  collaborative EU- 
                   for systems covering English (Bernth, 1997) and          project 1.  Together with the Danish prototype, 
                   other languages such as Italian  (Bolioli et  al., 
                    1992),  Spanish  and  Greek  (Bustamante  and              1Main  contractors   in   the  consortium  were: 
                   Ldon,  1996),  Czech  (Holan  et  al.,  1997)  and       WordFinder  Software  AB  (Sweden),      Center  for 
                                                                       255 
                            the  project  has  also produced  similar  systems            such  and  tries  to  suggest  a  replacement.  The 
                            for Swedish and Norwegian, all of them tailored               algorithm  used  is  based  on  trigram  and  tri- 
                            to  meet  the specific needs of the  Scandinavian             phone  analysis  (van  Berkel  and  Smedt,  1988), 
                            publishing  industry.  They all  provide  writing             and takes into account the orthographic strings 
                            support in the form of word and grammar check-                corresponding  to  the  invalid  word  under  con- 
                            ing.                                                          sideration and its possible replacement,  as well 
                              The Danish version of the system 2 constitutes              as the phonetic representations of the same two 
                            a  further development of the CORRie prototype                words.  Phonetic  representations  are generated 
                            (Vosse,  1992)  (Vosse,  1994),  adapted  to  deal            by a set of grapheme-to-phoneme rules (Hansen, 
                           with the Danish language,  and to the needs of                 1999) the aim of which is to assign phonetically 
                           the project's end  users.  The  system processes               motivated misspellings and their  correct  coun- 
                           text  in batch mode and produces an annotated                  terparts  identical  or similar  phonetic represen- 
                           output  text  where  errors  are  flagged  and  re-            tations. 
                           placements suggested where possible. Text cor-                   Then  the  system  tries  to  identify  context- 
                           rection is performed in two steps:  first the sys-             dependent spelling errors.  This is done by pars- 
                           tem deals with spelling errors and typos result-               ing the  text.  Parsing  results  are passed on to 
                           ing  in  invalid  words,  and  then  with  grammar             a  corrector  to  find  replacements  for the errors 
                           errors.                                                        found.  The parser is an implementation  of the 
                              Invalid  words  are  identified  on  the  basis  of         Tomita  algorithm  with  a  component  for  error 
                           dictionary  lookup.        The  dictionary  presently          recognition  whose job is to keep track of error 
                           consists of 251,000 domain-relevant  word forms                weights and feature mismatches as described in 
                           extracted  from  a  collection  of 68,000  newspa-             (Vosse,  1991).  Each input sentence is assigned 
                           per articles.  A  separate  idiom list allowing for            the analysis with the lowest error weight.  If the 
                           the  identification  of multi-word  expressions  is            error is due to a feature mismatch, the offending 
                           also available.  Among the words not found in                  feature is overridden,  and if a  dictionary entry 
                           the dictionary or the idiom list, those occurring              satisfying the grammar constraints expressed by 
                           most frequently in the text  (where frequency is               the context  is found in the dictionary,  it  is of- 
                           assessed relative to the length  of the text)  are             fered as a  replacement.  If the structure  is  in- 
                           taken to be new words or proper names 3.  The                  complete, on the other hand,  an error message 
                           remaining unknown words are passed on to the                   is generated.  Finally, if the system identifies an 
                           compound analysis grammar,  which is a  set of                 error  as  a  split-up  or a  run-on,  it  will suggest 
                           regular expressions covering the most common                   either  a  possible concatenation,  or a  sequence 
                           types of compound nominals in Danish. This is                  of valid words into which the misspelt word can 
                           an important  feature,  as in Danish compound-                 be split up. 
                           ing is very productive, and compounds are writ- 
                           ten as single words.                                           3    The  errors 
                              Words still unknown at this point  are taken                To ensure the coverage of relevant error types, 
                           to be spelling errors.  The System flags them as               a  set  of parallel  unedited  and  proofread  texts 
                            Sprogteknologi  (Denmark),  Department  of Linguistics        provided by the Danish end users has been col- 
                            at Uppsala University (Sweden), Institutt for lingvistikk     lected.  This text  collection consists of newspa- 
                            og litteraturvitenskab at the University  of Bergen  (Nor-    per and magazine articles published in 1997 for 
                            way), and Svenska Dagbladet  (Sweden).  A number of           a  total  of 270,805  running  words.  The articles 
                            subcontractors also contributed to the project.  Subcon-      have been collected in their raw version, as well 
                            tractors  in  Denmark  were:  Munksgaard  International       as  in  the  edited  version  provided  by the  pub- 
                            Publishers,  Berlingske Tidende,  Det Danske Sprog- og        lisher's  own  proofreaders.  Although  not  very 
                            Litteraturselskab,  and  Institut  for  Almen  og  Anvendt 
                            Sprogvidenskab  at the University of Copenhagen.              large  in  number  of words,  th@  corpus  consists 
                               2In  addition  to  the  author  of the  present  paper,    of excerpts  from  450  different  articles  to  en- 
                            tlle Danish SCARRIE team at CST consisted of Claus            sure a good spread of lexical domains and error 
                            Povlsen, Bart Kongejan and Bradley Music.                     types.  The  corpus has  been used to construct 
                               3The system also checks whether a closely matching         test  suites for  progress  evaluation,  and  also to 
                            alternative can be found in the dictionary,  to avoid mis- 
                            taking a consistently  misspelt word for a new word.          guide grammar  development.  The  aim set  for 
                                                                                    256 
                                         Error type                                                    No.           %                      The sentence below, on the other hand, is an 
                                         Context independent errors                                    386           38                example of structural error. 
                                         Context dependent errors                                      308           30                            (2)       i sin tid *skabet han skulpturer over 
                                         Punctuation problems                                          212           21                                      atomkraften 
                                         Style problems                                                  89            9                                      (during his time wardrobe/created he 
                                         Graphical problems                                              24            2                                     sculptures about nuclear power) 
                                         Total                                                       1019          100                      Since the finite verb skabte (created) has been 
                                                                                                                                        misspelt as skabet (the wardrobe), the syntactic 
                                   Figure 1:  Error distribution in the Danish cor-                                                     structure corresponding to the sentence is miss- 
                                   pus                                                                                                  ing a  verbal head. 
                                                                                                                                            Run-ons and split-ups are structural errors of 
                                   grammar development was then to enable the                                                           a particular kind, having to do with leaves in the 
                                   system to  identify and analyse the grammati-                                                        syntactic tree.  In some cases they can only be 
                                   cal constructions in which errors typically occur,                                                   detected on the basis of the context, because the 
                                   whilst to some extent disregarding the remain-                                                       misspelt word has the wrong category or bears 
                                   der of the text.                                                                                     some other grammatical feature that is incorrect 
                                       The errors occurring in the corlbus have been                                                    in the context.  Examples are given in (3)  and 
                                   analysed according to the taxonomy in  (Ram-                                                         (4) below, which like the preceding examples are 
                                   bell,  1997).  Figure 1 shows the distribution of                                                    taken from the project's corpus.  In both cases, 
                                   the  various  error  types  into  the  five  top-level                                               the error would be a  valid word in a  different 
                                   categories of the  taxonomy.  As  can  be  seen,                                                     context.  More specifically, rigtignok (indeed) is 
                                   grammar errors account for 30~0 of the errors.                                                       an adverb, whilst rigtig nok (actually correct) is 
                                   Of these, 70% fall into one of the following cat-                                                    a  modified adjective; and inden .for (inside) is a 
                                   egories (Povlsen, 1998):                                                                             preposition, whilst indenfor (indoors) is an ad- 
                                                                                                                                        verb.  In both examples the correct alternative 
                                        •  Too many finite verbal forms or missing fi-                                                  is indicated in parentheses. 
                                            nite verb                                                                                               (3)       ...  studerede rain gruppe *rigtig nok 
                                        •  Errors in nominal phrases:                                                                                         (rigtignok) under temaoverskrifter 
                                                -    agreement errors,                                                                                        (studied my group indeed on the basis 
                                                                                                                                                              of topic headings) 
                                                -    wrong determination,                                                                           (4)       *indenfor (inden for) de gule mute 
                                                -    genitive errors,                                                                                         (inside the yellow walls) 
                                                -    errors concerning pronouns;                                                             Although the system has a facility for identi- 
                                        •   Split-ups and run-ons.                                                                      fying and correcting split-ups and run-ons based 
                                                                                                                                        on  a  complex  interaction  between  the  dictio- 
                                       Another way of grouping the errors is by the                                                     nary,  the  idiom  list,  the  compound  grammar 
                                   kind of parsing failure they generate:  they can                                                     and the syntactic grammar, this facility has not 
                                   then be  viewed as  either feature  mismatches,                                                      been fully developed yet, and will therefore not 
                                   or  as  structural  errors.  Agreement errors  are                                                   be described any further here.  More details can 
                                   typical examples of feature mismatches.  In the                                                      be found in (Paggio, 1999). 
                                   following nominal phrase, for example:                                                               4        The  grammar 
                                               (1)       de *interessant projekter                                                      The  grammar  is  an  augmented  context-free 
                                                         (the interesting projects)                                                     grammar consisting of rewrite rules where sym- 
                                                                                                                                        bols are associated with features.  Error weights 
                                      _the error can be formalised as a mismatch be-                                                    and error messages can also be attached to ei- 
                                   tween the definiteness of the determiner de (the)                                                    ther rules or single features.  The rules are ap- 
                                   and the indefiniteness of the adjective interes-                                                     plied by unification, but in cases where one or 
                                   sant (interesting).  Adjectives have in fact both                                                    more features do not  unify, the offending fea- 
                                   an indefinite and a  definite form in Danish.                                                        tures will be overridden. 
                                                                                                                               257 
                           In  the  current  version of the grammar~  only         The  feature  overriding  mechanism  makes  it 
                         the  structures  relevant  to  the  error  types  we    possible for the system to suggest  interessante 
                         want  the system to deal with -  in other words         as the correct replacement in (7), and projekter 
                         nominal  phrases  and  verbal  groups  -  are  ac-      in  (8).  Let  us see how this is done in more de- 
                         counted for in detail.  The analysis produced is        tail  for example  (7).  The parser  tries to apply 
                         thus a  kind of shallow syntactic analysis where        the NP rule to the input string.  The rule states 
                         the various sentence  constituents  are  attached       that  the adjective phrase must be definite  (AP 
                         under the topmost S node as fragments.                  (def _ _)).  But the dictionary entry correspond- 
                           For example,  adjective phrases  can be anal-         ing to interessant bears the feature 'indef'.  The 
                         ysed  as  fragments,  as  shown  in  the  following     parser  will  override  this  feature  and  build  an 
                         rule:                                                   NP  according  to  the  constraints  expressed  by 
                                                                                 the rule.  At this point, a new dictionary lookup 
                              Fragment  ->                                       is  performed,  and  the  definite form of the  ad- 
                                AP "?Fragment AP rule":2                         jective can be suggested as a replacement. 
                           To indicate that the fragment analysis is not            Weights  are  used to control  rule  interaction 
                         optimal,  it  is  associated  with  an  error  weight,  as well as to establish priorities among features 
                         as well as  an error  message to  be used for de-       that  may have to be overridden.  For example 
                         bugging purposes (the message is not visible to         in our NP rule,  a  weight  has been attached  to 
                         the end user).  The weight penalises parse trees        the Gender feature in the N  node.  The weight 
                         built by applying the rule.  The rule is used e.g.      expresses the  fact  that  it  costs  more  to  over- 
                         to analyse an AP following a copula verb as in:         ride  gender  on the  head  noun than  on the de- 
                                                                                 terminer or adjective.  The rationale behind this 
                               (5)   De projekter er ikke interessante.          is  the  fact  that  if there  is  a  gender  mismatch, 
                                     (Those projects are not interesting)        the parser should not try to find an alternative 
                                                                                • form of the noun (which does not exist), but if 
                           The  main  motivation  for  implementing  a           necessary override the gender feature either on 
                         grammar  based  on  the  idea of fragments  was         the adjective or the determiner. 
                         efficiency.  Furthermore,  the fragment  strategy 
                         could be implemented  very quickly.  However,           4.2.   Capturing  structural  errors  in 
                         as will be clarified in Section 5, this strategy is            grammar  rules 
                         sometimes responsible for bad flags.                     To capture structural  errors,  the formalism al- 
                         4.1    Feature  mismatches                               lows the grammar writer to write so-called error 
                         As an alternative to the fragment analysis, APs         rules.  The syntax of error rules is very similar 
                         can be attached as daughters in NPs.  This is of         to  that  used  in  'normal'  rules,  the  only differ- 
                         course necessary for the treatment of agreement          ence being that  an error rule must have an er- 
                         in NPs, one of the error types targeted  in our        • ror weight and an error message attached to it. 
                         application.  This is shown in the following rule:       The purpose of the weight is to ensure that  er- 
                                                                                  ror  rules  are  applied only if 'normal'  rules  are 
                         NP(def Gender PersNumber)  ->                            not applicable.  The error message can serve two 
                                    Det (def Gender PersNumber)                   purposes.  Depending on whether it is stated as 
                                    AP(def _ _)                                   an implicit or an explicit message (i.e.  whether 
                                    N(indef Gender:9- PersNumber)                 it is preceded by a question mark or not), it will 
                                                                                  appear  in the log file where it  can be used for 
                            The rule will parse a correct definite NP such        debugging purposes, or in the output text  as a 
                         as:                                                      message to the end user. 
                                (6)  de interessante projekter                      The following is an error rule example. 
                                      (the interesting projects) 
                            but also                                                   VGroup(_ finite Tense)  -> 
                                                                                          V(_ finite:4 Tense) 
                                (7)  de *interessant projekter                            V(_ finite:4 _) 
                                (S)   de interessante *projekterne                     "Sequence of two finite verbs":4 
                                                                             ,'jr- O 
                                                                            258
The words contained in this file might help you see if this file matches what you are looking for:

...Spelling and grammar correction for danish in scarrie patrizia paggio center sprogteknologi copenhagen dk cst ku abstract swedish hein this paper reports on work carried out to de describes the prototype of a velop corrector dan ish addressing particular issue how which combines traditional checking form shallow parsing is combined with er functionalities ability carry com ror detection treatment pound analysis detect correct cer context dependent errors syn tain types tactic used by system hereafter simply gram has been developed aim dealing mar most frequent error found parallel text making use feature overriding corpus unedited proofread texts specif weights accommodate although ically collected project s end users full parse each sentence attempted focussing certain grammatical constructions it possible only exploit linguistic intelligence provided proof syntactic yet keep read specifically robust efficient described grammati thus superior other existing checkers cal its deal intel...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area