Spelling and Grammar Correction for Danish in SCARRIE Patrizia Paggio Center for Sprogteknologi Copenhagen (DK) patrizia@cst, ku. dk Abstract Swedish (Hein, 1998). This paper reports on work carried out to de- This paper describes the prototype of a velop a spelling and grammar corrector for Dan- spelling and grammar corrector for Danish ish, addressing in particular the issue of how a which combines traditional spelling checking form of shallow parsing is combined with er- functionalities with the ability to carry out com- ror detection and correction for the treatment pound analysis and to detect and correct cer- of context-dependent spelling errors. The syn- tain types of context-dependent spelling errors tactic grammar for Danish used by the system (hereafter simply "grammar errors"). Gram- has been developed with the aim of dealing with mar correction is carried out by parsing the the most frequent error types found in a parallel text, making use of feature overriding and error corpus of unedited and proofread texts specif- weights to accommodate the errors. Although ically collected by the project's end users. By a full parse of each sentence is attempted, the focussing on certain grammatical constructions grammar has been developed with the aim of and certain error types, it has been possible dealing only with the most frequent error types to exploit the linguistic 'intelligence' provided found in a parallel corpus of unedited and proof- by syntactic parsing and yet keep the system read texts specifically collected by the project's robust and efficient. The system described is end users. By focussing on certain grammati- thus superior to other existing spelling checkers cal constructions and certain error types, it has for Danish in its ability to deal with context- been possible to exploit the linguistic 'intelli- dependent errors. gence' provided by syntactic parsing and yet keep the system robust and efficient. The sys- 1 Introduction tem described is thus superior to other existing In her much-quoted and still relevant review of spelling checkers for Danish in its ability to deal technologies for automatic word correction (Ku- with certain types of grammar errors. kich, 1992), Kukich observes that "research in We begin by giving an overview of the sys- context-dependent spelling correction is in its tem's components in Section 2. In Section 3 we infancy" (p. 429), and that the task of treating describe the error types we want to deal with: context-dependent errors is still an elusive one Section 4 gives an overview of the grammar: due to the complexity of the linguistic knowl- in particular, the methods adopted for treating edge often necessary to analyse the context in feature mismatches and structural errors are ex- sufficient depth to find and correct such er- plained. Finally, in Section 5 evaluation results rors. But progress in parsing technology and are presented and a conclusion is drawn. the growing speed of computers seem to have 2 The prototype made the task less of a chimera. The '90s The prototype is a system for high-quality have in fact seen a renewed interest in gram- proofreading for Danish which has been de- mar checking, and proposals have been made veloped in the context of a collaborative EU- for systems covering English (Bernth, 1997) and project 1. Together with the Danish prototype, other languages such as Italian (Bolioli et al., 1992), Spanish and Greek (Bustamante and 1Main contractors in the consortium were: Ldon, 1996), Czech (Holan et al., 1997) and WordFinder Software AB (Sweden), Center for 255 the project has also produced similar systems such and tries to suggest a replacement. The for Swedish and Norwegian, all of them tailored algorithm used is based on trigram and tri- to meet the specific needs of the Scandinavian phone analysis (van Berkel and Smedt, 1988), publishing industry. They all provide writing and takes into account the orthographic strings support in the form of word and grammar check- corresponding to the invalid word under con- ing. sideration and its possible replacement, as well The Danish version of the system 2 constitutes as the phonetic representations of the same two a further development of the CORRie prototype words. Phonetic representations are generated (Vosse, 1992) (Vosse, 1994), adapted to deal by a set of grapheme-to-phoneme rules (Hansen, with the Danish language, and to the needs of 1999) the aim of which is to assign phonetically the project's end users. The system processes motivated misspellings and their correct coun- text in batch mode and produces an annotated terparts identical or similar phonetic represen- output text where errors are flagged and re- tations. placements suggested where possible. Text cor- Then the system tries to identify context- rection is performed in two steps: first the sys- dependent spelling errors. This is done by pars- tem deals with spelling errors and typos result- ing the text. Parsing results are passed on to ing in invalid words, and then with grammar a corrector to find replacements for the errors errors. found. The parser is an implementation of the Invalid words are identified on the basis of Tomita algorithm with a component for error dictionary lookup. The dictionary presently recognition whose job is to keep track of error consists of 251,000 domain-relevant word forms weights and feature mismatches as described in extracted from a collection of 68,000 newspa- (Vosse, 1991). Each input sentence is assigned per articles. A separate idiom list allowing for the analysis with the lowest error weight. If the the identification of multi-word expressions is error is due to a feature mismatch, the offending also available. Among the words not found in feature is overridden, and if a dictionary entry the dictionary or the idiom list, those occurring satisfying the grammar constraints expressed by most frequently in the text (where frequency is the context is found in the dictionary, it is of- assessed relative to the length of the text) are fered as a replacement. If the structure is in- taken to be new words or proper names 3. The complete, on the other hand, an error message remaining unknown words are passed on to the is generated. Finally, if the system identifies an compound analysis grammar, which is a set of error as a split-up or a run-on, it will suggest regular expressions covering the most common either a possible concatenation, or a sequence types of compound nominals in Danish. This is of valid words into which the misspelt word can an important feature, as in Danish compound- be split up. ing is very productive, and compounds are writ- ten as single words. 3 The errors Words still unknown at this point are taken To ensure the coverage of relevant error types, to be spelling errors. The System flags them as a set of parallel unedited and proofread texts Sprogteknologi (Denmark), Department of Linguistics provided by the Danish end users has been col- at Uppsala University (Sweden), Institutt for lingvistikk lected. This text collection consists of newspa- og litteraturvitenskab at the University of Bergen (Nor- per and magazine articles published in 1997 for way), and Svenska Dagbladet (Sweden). A number of a total of 270,805 running words. The articles subcontractors also contributed to the project. Subcon- have been collected in their raw version, as well tractors in Denmark were: Munksgaard International as in the edited version provided by the pub- Publishers, Berlingske Tidende, Det Danske Sprog- og lisher's own proofreaders. Although not very Litteraturselskab, and Institut for Almen og Anvendt Sprogvidenskab at the University of Copenhagen. large in number of words, th@ corpus consists 2In addition to the author of the present paper, of excerpts from 450 different articles to en- tlle Danish SCARRIE team at CST consisted of Claus sure a good spread of lexical domains and error Povlsen, Bart Kongejan and Bradley Music. types. The corpus has been used to construct 3The system also checks whether a closely matching test suites for progress evaluation, and also to alternative can be found in the dictionary, to avoid mis- taking a consistently misspelt word for a new word. guide grammar development. The aim set for 256 Error type No. % The sentence below, on the other hand, is an Context independent errors 386 38 example of structural error. Context dependent errors 308 30 (2) i sin tid *skabet han skulpturer over Punctuation problems 212 21 atomkraften Style problems 89 9 (during his time wardrobe/created he Graphical problems 24 2 sculptures about nuclear power) Total 1019 100 Since the finite verb skabte (created) has been misspelt as skabet (the wardrobe), the syntactic Figure 1: Error distribution in the Danish cor- structure corresponding to the sentence is miss- pus ing a verbal head. Run-ons and split-ups are structural errors of grammar development was then to enable the a particular kind, having to do with leaves in the system to identify and analyse the grammati- syntactic tree. In some cases they can only be cal constructions in which errors typically occur, detected on the basis of the context, because the whilst to some extent disregarding the remain- misspelt word has the wrong category or bears der of the text. some other grammatical feature that is incorrect The errors occurring in the corlbus have been in the context. Examples are given in (3) and analysed according to the taxonomy in (Ram- (4) below, which like the preceding examples are bell, 1997). Figure 1 shows the distribution of taken from the project's corpus. In both cases, the various error types into the five top-level the error would be a valid word in a different categories of the taxonomy. As can be seen, context. More specifically, rigtignok (indeed) is grammar errors account for 30~0 of the errors. an adverb, whilst rigtig nok (actually correct) is Of these, 70% fall into one of the following cat- a modified adjective; and inden .for (inside) is a egories (Povlsen, 1998): preposition, whilst indenfor (indoors) is an ad- verb. In both examples the correct alternative • Too many finite verbal forms or missing fi- is indicated in parentheses. nite verb (3) ... studerede rain gruppe *rigtig nok • Errors in nominal phrases: (rigtignok) under temaoverskrifter - agreement errors, (studied my group indeed on the basis of topic headings) - wrong determination, (4) *indenfor (inden for) de gule mute - genitive errors, (inside the yellow walls) - errors concerning pronouns; Although the system has a facility for identi- • Split-ups and run-ons. fying and correcting split-ups and run-ons based on a complex interaction between the dictio- Another way of grouping the errors is by the nary, the idiom list, the compound grammar kind of parsing failure they generate: they can and the syntactic grammar, this facility has not then be viewed as either feature mismatches, been fully developed yet, and will therefore not or as structural errors. Agreement errors are be described any further here. More details can typical examples of feature mismatches. In the be found in (Paggio, 1999). following nominal phrase, for example: 4 The grammar (1) de *interessant projekter The grammar is an augmented context-free (the interesting projects) grammar consisting of rewrite rules where sym- bols are associated with features. Error weights _the error can be formalised as a mismatch be- and error messages can also be attached to ei- tween the definiteness of the determiner de (the) ther rules or single features. The rules are ap- and the indefiniteness of the adjective interes- plied by unification, but in cases where one or sant (interesting). Adjectives have in fact both more features do not unify, the offending fea- an indefinite and a definite form in Danish. tures will be overridden. 257 In the current version of the grammar~ only The feature overriding mechanism makes it the structures relevant to the error types we possible for the system to suggest interessante want the system to deal with - in other words as the correct replacement in (7), and projekter nominal phrases and verbal groups - are ac- in (8). Let us see how this is done in more de- counted for in detail. The analysis produced is tail for example (7). The parser tries to apply thus a kind of shallow syntactic analysis where the NP rule to the input string. The rule states the various sentence constituents are attached that the adjective phrase must be definite (AP under the topmost S node as fragments. (def _ _)). But the dictionary entry correspond- For example, adjective phrases can be anal- ing to interessant bears the feature 'indef'. The ysed as fragments, as shown in the following parser will override this feature and build an rule: NP according to the constraints expressed by the rule. At this point, a new dictionary lookup Fragment -> is performed, and the definite form of the ad- AP "?Fragment AP rule":2 jective can be suggested as a replacement. To indicate that the fragment analysis is not Weights are used to control rule interaction optimal, it is associated with an error weight, as well as to establish priorities among features as well as an error message to be used for de- that may have to be overridden. For example bugging purposes (the message is not visible to in our NP rule, a weight has been attached to the end user). The weight penalises parse trees the Gender feature in the N node. The weight built by applying the rule. The rule is used e.g. expresses the fact that it costs more to over- to analyse an AP following a copula verb as in: ride gender on the head noun than on the de- terminer or adjective. The rationale behind this (5) De projekter er ikke interessante. is the fact that if there is a gender mismatch, (Those projects are not interesting) the parser should not try to find an alternative • form of the noun (which does not exist), but if The main motivation for implementing a necessary override the gender feature either on grammar based on the idea of fragments was the adjective or the determiner. efficiency. Furthermore, the fragment strategy could be implemented very quickly. However, 4.2. Capturing structural errors in as will be clarified in Section 5, this strategy is grammar rules sometimes responsible for bad flags. To capture structural errors, the formalism al- 4.1 Feature mismatches lows the grammar writer to write so-called error As an alternative to the fragment analysis, APs rules. The syntax of error rules is very similar can be attached as daughters in NPs. This is of to that used in 'normal' rules, the only differ- course necessary for the treatment of agreement ence being that an error rule must have an er- in NPs, one of the error types targeted in our • ror weight and an error message attached to it. application. This is shown in the following rule: The purpose of the weight is to ensure that er- ror rules are applied only if 'normal' rules are NP(def Gender PersNumber) -> not applicable. The error message can serve two Det (def Gender PersNumber) purposes. Depending on whether it is stated as AP(def _ _) an implicit or an explicit message (i.e. whether N(indef Gender:9- PersNumber) it is preceded by a question mark or not), it will appear in the log file where it can be used for The rule will parse a correct definite NP such debugging purposes, or in the output text as a as: message to the end user. (6) de interessante projekter The following is an error rule example. (the interesting projects) but also VGroup(_ finite Tense) -> V(_ finite:4 Tense) (7) de *interessant projekter V(_ finite:4 _) (S) de interessante *projekterne "Sequence of two finite verbs":4 ,'jr- O 258
no reviews yet
Please Login to review.