Korean Pdf 103896

Partial capture of text on file.
                               ANewAnnotationSchemefortheSejongPart-of-speechTaggedCorpus
                                                                   Jungyeul Park                                                                  Francis Tyers
                                                        Department of Linguistics                                                      Department of Linguistics
                                                             University at Buffalo                                                            Indiana University
                                                    jungyeul@buffalo.edu                                                              ftyers@indiana.edu
                                                                Abstract                                                     프랑스의         프랑스/NNP+의/JKG                          peurangseu-ui   ‘France-GEN’
                                                                                                                             세계적인         세계/NNG+적/XSN+이/VCP+ㄴ/ETM segye-jeok-i-n                ‘world class-REL’
                                                                                                                             의상           의상/NNG                                 uisang          ‘fashion’
                                    In this paper we present a new annotation                                                디자이너         디자이너/NNG                               dijaineo        ‘designer’
                                                                                                                             엠마누엘         엠마누엘/NNP                               emmanuel        ‘Emanuel’
                                    scheme for the Sejong part-of-speech tagged                                              웅가로가         웅가로/NNP+가/JKS                          unggaro-ga      ‘Ungaro-NOM’
                                                                                                                             실내           실내/NNG                                 silnae          ‘interior’
                                    corpus based on Universal Dependencies                                                   장식용          장식용/NNG                                jangsikyong     ‘decoration’
                                                                                                                             직물           직물/NNG                                 jikmul          ‘textile’
                                    style annotation. By using a new annotation                                              디자이너로 디자이너/NNG+로/JKB                                dijaineo-ro     ‘designer-AJT’
                                    scheme,wecanproduceSejong-stylemorpho-                                                   나섰다.         나서/VV+었/EP+다/EF+./SF                   naseo-eoss-da.  ‘become-PAST-IND-.’
                                    logical analysis and part-of-speech tagging re-                                        Figure 1: Examples in the Sejong POS tagged corpus:
                                    sults which have been the de facto standard for                                        ‘TheworldclassFrenchfashiondesignerEmanuelUn-
                                    Korean language processing. We also explore                                            garo became a designer of interior textile decorations.’
                                    the possibility of doing named-entity recogni-                                         (SeeTable1forPOStaginformationintheSejongcor-
                                    tion and semantic-role labelling for Korean us-                                        pus)
                                    ing the new annotation scheme.
                             1      Introduction                                                                           POStagsfortheentire annotated corpus. Figure 1
                             In 1998 the Ministry of Culture and Tourism of                                                shows an example of the annotation in the Sejong
                             Korea launched the 21st Century Sejong Project                                                POS-tagged corpus.
                             to promote Korean language information process-                                                    As the Sejong corpus is the largest annotated
                             ing. The project is named after Sejong the Great                                              corpus of Korean and as it uses a segmentation
                             whoconceivedandledtheinventionofhangul,the                                                    scheme based on eojeols, most Korean language
                             Koreanalphabet. The corpus was released in 2003                                               processing systems have subsequently been de-
                             andwascontinuallyupdateduntil2011,producing                                                   veloped using this as their basic segmentation
                             the largest corpus of Korean to date. It includes                                             scheme. There are many language processing sys-
                             the several types of texts: historical, contempo-                                             tems based on the eojeol-segmentation schemes,
                             rary, and parallel texts. The section of contempo-                                            for example: POS tagging (Hong, 2009; Na, 2015;
                             rary corpora contains both oral and written texts.                                            Park et al., 2016) and dependency parsing (Oh,
                             In this paper we focus on the contemporary writ-                                              2009; Oh and Cha, 2010; Park et al., 2013).
                             ten text which is annotated for morphology. This                                                   There are, however, different segmentation
                             is referred to as the Sejong part-of-speech tagged                                            granularity levels — that is, ways to tokenise
                             corpus.                                                                                       wordsinsentences—forKoreanwhichhavebeen
                                  The contents of the Sejong POS-tagged corpus                                             independently proposed in previous work as basic
                             represent a variety of sources: newswire text, mag-                                           units.
                             azine articles on various subjects and topics, sev-                                                ThispaperexplorestheSejongPOS-taggedcor-
                             eral book excerpts, and crawled texts from the                                                pus to deﬁne a new annotation method for end-
                             internet. The current version of the morphologi-                                              to-end morphological analysis and POS tagging.
                             cally annotatedPOS-taggedcorpusconsistsof279                                                  Many upstream applications for Korean language
                                                                                                                   1
                             ﬁles with over 802K sentences and 9.2M eojeols.                                               processing are based on a segmentation scheme in
                             The current annotation scheme in the Sejong cor-                                              which all morphemes are separated. For example
                             pus is exclusively based on the eojeol concept.                                               Choi et al. (2012) and Park et al. (2016) present
                             ThecorpususestheSejongtagsetthat contains 44                                                  workonphrase-structureparsing,andworkonsta-
                                  1Aneojeol is a word separated by blank spaces.                                           tistical machine translation (SMT) is presented by
                                                                                                                     195
                                                                   Proceedings of the 13th Linguistic Annotation Workshop, pages 195–202
                                                                                                           c
                                                            Florence, Italy, August 1, 2019. 
2019 Association for Computational Linguistics
                                    Park et al. (2016, 2017), etc. This is done in or-                                                                                                        Sejong POS (S)            description            Universal POS (U)
                                                                                                                                                                                  NNG,NNP,NNB,NR,XR                     noun related           NOUN
                                    der to avoid data sparsity, because longer segmen-                                                                                                                  NNP             proper noun            PROPN
                                                                                                                                                                                                          NP              pronoun              PRON
                                                                                                                                                                                                        MAG                adverb              ADV
                                    tation granularity can combine words in an expo-                                                                                                                    MAJ          conjunctive adverb        CONJ
                                                                                                                                                                                                         MM              determiner            DET
                                    nential way.                                                                                                                                         VV,VX,VCN,VCP                  verb related           VERB
                                                                                                                                                                                                          VA              adjective            ADJ
                                          Weproposeanewapproachtoannotationusing                                                                                                      EP, EF, EC, ETN, ETM             verbal endings          PART
                                                                                                                                                               JKS, JKC, JKG, JKO, JKB, JKV, JKQ, JX, JC        postpositions (case markers)   ADP
                                                                                                                                                                                      XPN,XSN,XSA,XSV                     sufﬁxes              PART
                                    amorphologicallyseparatedwordbasedontheap-                                                                                                                             IC           interjection           INTJ
                                                                                                                                                                                          SF, SP, SE, SO, SS         punctuation marks         PUNCT
                                    proachforannotatingmultiwordtokens(MWT)in                                                                                                                            SW          special characters        X
                                                                                     2                                                                                                                SH,SL          foreign characters        X
                                    the CoNLL-U format. Using the new annotation                                                                                                                          SN              number               NUM
                                                                                                                                                                                                 NA,NF,NV             unknownwords             X
                                    scheme, we can also explore tasks beyond POS
                                    tagging such as named-entity recognition (NER)                                                                        Table 1: POS tags in the Sejong corpus and their 1-to-1
                                    andsemanticrolelabelling(SRL).Whilethereare                                                                           mappingtoUniversal POS tags
                                    a number of papers looking at NER for Korean
                                    (Chung et al., 2003; Yun, 2007), and SRL (Kim                                                                         2.1         Universal POS tags and their mapping
                                                                3
                                    et al., 2014) , these tasks have hardly been dis-                                                                     Tofacilitate future research and to standardize best
                                    cussed in previous literature on Korean language                                                                      practices, (Petrov et al., 2012) proposed a tagset of
                                    processing. It has been considered to be difﬁcult to                                                                  Universal POS categories. The current Universal
                                    deal with using the current annotation scheme of                                                                      POS tag mapping for Sejong POS tags is based
                                    the Sejong POS corpus because of the limitations                                                                      on a handful of POS patterns of eojeols. How-
                                    of the current eojeol-based annotation and the ag-                                                                    ever, combinations of words in Korean are very
                                    glutinative characteristics of the language. For ex-                                                                  productive and exponential. Therefore, the num-
                                    ample, for NER, having postpositions attached to                                                                      ber of POS patterns of the word does not con-
                                    the last word in the phrase they modify can make                                                                      verge even though the number of words increases.
                                    it more difﬁcult to identify the named entity. The                                                                    For example, the Sejong treebank contains about
                                    annotation scheme we propose (see Figure 3) is                                                                        450K words and almost 5K POS patterns. We
                                    also different from the current annotation scheme                                                                     also test with the Sejong morphologically anal-
                                    in Universal Dependencies for Korean morphol-                                                                         ysed corpus which contains 9.2M eojeols. The
                                    ogy, which represents combined morphemes for                                                                          number of POS patterns does not converge and
                                    eojoels (see Figure 4).                                                                                               it increases up to over 50K. The wide range of
                                    2        CoNLL-UFormatforKorean                                                                                       POS patterns is mainly due to the ﬁne-grained
                                                                                                                                                          morphological analysis, which shows all possible
                                                                                                                                                          segmentations divided into lexical and functional
                                    We use CoNLL-U style Universal Dependency                                                                             morphemes. These various POS patterns might
                                    (UD) annotation for Korean morphology. We ﬁrst                                                                        indicate useful morpho-syntactic information for
                                    review the current approaches to annotating Ko-                                                                       Korean. To beneﬁt from the detailed annotation
                                    rean in UD and their potential limitations. The                                                                       scheme in the Sejong treebank, (Oh et al., 2011)
                                    CoNLL-U format is a revised version of the pre-                                                                       predicted function labels (phrase-level tags) using
                                    vious CoNLL-X format, which contains ten ﬁelds                                                                        POSpatternsthatimprovedependencyparsingre-
                                    from word index to dependency relation to the                                                                         sults. Table 1 shows the summary of the Sejong
                                    head. This paper concerns only the morphological                                                                      POStagsetanditsdetailedmappingtotheUniver-
                                    annotation: word form, lemma, universal POS tag                                                                       sal POS tags. Note that we convert the XR (non-
                                    and language-speciﬁc POS tag (Sejong POS tag).                                                                        autonomous lexical root) into the NOUN because
                                    Theother ﬁelds will be annotated either by an un-                                                                     they are mostly considered nouns or a part of a
                                    derscore which represents not being available or                                                                      noun:e.g., minju/XR (‘democracy’).
                                    dummy information so that it is well-formed for
                                    input into applications that process the CoNLL-                                                                       2.2         MWTsinUD
                                                                                                                                              ´
                                    U format such as UDPipe (Straka and Strakova,                                                                         Multiword token (MWT) annotation has been ac-
                                    2017).                                                                                                                commodated in the CoNLL-U format, in which
                                                                                                                                                          MWTsare indexed with ranges from the ﬁrst to-
                                           2http://universaldependencies.org/                                                                             ken in the word to the last token in the word, e.g.
                                    format.html
                                           3There is also Penn Korean PropBank (https://                                                                  1-2. These have a value in the word form ﬁeld, but
                                    catalog.ldc.upenn.edu/LDC2006T03)                                                                                     haveanunderscoreinalltheremainingﬁelds.This
                                                                                                                                                  196
                                     ´                                                         wordform    lemma
                              1-2   vamonos
                              1     vamos      ir (‘go’)                        verbal ending     ㄴ          은
                              2     nos        nosotros (‘us’)                                   ㄹ지         을지
                              ...                                                 case marker      가         이      (‘NOM’)
                                  (a) vamonos (‘let’s go’)
                                      ´                                                            를         을      (‘ACC’)
                                                                                                   는         은      (‘AUX’)
                           ...
                           18-20   naseossda                                       Table 2: Sufﬁx normalisation examples
                           18      naseo       naseo (‘become’)
                           19      eoss        eoss (‘PAST’)
                           20      da          da (‘IND’)                  Sejong POS tag. For multiple-morpheme words,
                                  (b) naseossda (‘became’)                 we convert them as described in §2.2: word in-
                           Figure 2: Examples of MWTs in UD                dex ranges and word form followed by lines of
                                                                           morpheme form, lemma, universal POS tag and
                  multiword token is then followed by a sequence           Sejong POS tag. For the lemma of sufﬁxes, we
                  of words (or morphemes). For example, a Span-            use the Penn Korean treebank-style (Han et al.,
                  ish MWT vamonos (‘let’s go’) from the sentence           2002) sufﬁx normalisation as described in Ta-
                               ´                                           ble 2. The whole conversion table is provided in
                  vamonos al mar (‘let’s go to the sea’) is repre-
                   ´                                                       Appendix A. Figure 3 shows an example of the
                  sented in the CoNLL-U format as in Figure 2a.4           proposed CoNLL-U format for the Sejong POS
                  Vamonos which is the ﬁrst-person plural present
                    ´                                                      tagged corpus. As previously proposed for Korean
                  imperative of ir (‘go’) consists of vamos and nos        Universal Dependencies, we separate punctuation
                  in MWT-styleannotation.Inthisway,weannotate              marks from the word in order to tokenize them,
                  the Korean eojoel as MWTs. Figure 2b shows that          which is the only difference from the original Se-
                  naseossda(‘became’)inKoreancanalsoberepre-               jong corpus which is exclusively based on the eo-
                  sented as MWTs, and all morphemes including a            jeol (that is, punctuation is attached to the word
                  verb stem and inﬂectional-modal sufﬁxes are sep-         that precedes it). One of the main problems in
                  arated. Sag et al. (2002) deﬁned the various kinds       the Sejong POS tagged corpus is ambiguous an-
                  of MWTs, and Salehi et al. (2016) presented an           notation of symbols usually tagged with SF, SP,
                  approach to determine MWT types even with no             SE, SO, SS, SW. For example, the full stop in
                  explicit prior knowledge of MWT patterns in a            naseo/VV + eoss/EP + da/EF + ./SF (‘became’)
                                       ¨
                  given language. (C¸oltekin, 2016) describes a set        and the decimal point in 3/SN + ./SF + 14/SN
                  of heuristics for determining when to annotate in-       (‘3.14’) are not distinguished from each other.
                  dividual morphemes as features or separate syn-          Weidentifysymbolswhethertheyarepunctuation
                  tactic words in Turkish. The two main criteria are       marks using heuristic rules, and tokenize them.
                  (1) does the word enter into a labelled syntactic re-    Appendix B details and discusses the tokenisation
                  lation with another wordinthesentence(e.g.obvi-          problem, and how we can further process other
                  ating the need for a special relation for derivation);   symbols.
                  and (2) does the addition of the morpheme entail
                  possible feature class (e.g. two different values for    3.2   Experiments and Results
                  the Numberfeature in the same syntactic word).           For our experiments, we automatically convert the
                  3   ANewAnnotationScheme                                 Sejong POS-tagged corpus into CoNLL-U style
                                                                           annotation with MWE annotation for eojeols. We
                  This section describes a new annotation scheme           evaluate tokenisation, morphological analysis, and
                  for Korean. We propose a conversion method for           POS tagging results using UDPipe (Straka and
                  the existing UD-style annotation of the Sejong                    ´
                                                                           Strakova, 2017). We use the proposed corpus di-
                  POStaggedcorpustothenewscheme.                           vision of the Sejong POS tagged corpus for ex-
                  3.1   Conversion scheme                                  periments as described in Appendix C. We obtain
                                                                           99.88% f1 score for segmentation and 94.75% ac-
                  The conversion is straightforward. For one-              curacy for POS tagging for language speciﬁc POS
                  morpheme words, we convert them into word in-            tags (Sejong tag sets). Previously, Na (2015) ob-
                  dex, word form, lemma, universal POS tag and             tained 97.90% and 94.57% for segmentation and
                     4The     example     copied     from     http://      POS tagging respectively using the same Sejong
                  universaldependencies.org/format.html                    corpus. While we outperform the previous results
                                                                       197
                      # sent id = BTAA0001-00000012
                      # text = 프랑스의세계적인의상디자이너엠마누엘웅가로가실내장식용직물디자이너로나섰다.
                        1-2      프랑스의                                                                peurangseu-ui (‘France-GEN’)
                        1        프랑스             프랑스          PROPN      NNP                         peurangseu (‘France’)
                        2        의               의            ADP        JKG                         -ui (‘-GEN’)
                        3-6      세계적인                                                                segye-jeok-i-n (‘world class-REL’)
                        3        세계              세계           NOUN       NNG                         segye (‘world’)
                        4        적               적            PART       XSN                         -jeok (‘-SUF’)
                        5        이               이            VERB       VCP                         -i (‘-COP’)
                        6        ㄴ               은            PART       ETM                         -n (‘-REL’)
                        7        의상              의상           NOUN       NNG                         uisang (‘fashion’)
                        8        디자이너            디자이너 NOUN NNG                                       dijaineo (‘designer’)
                        9        엠마누엘            엠마누엘 PROPN NNP                                      emmanuel(‘Emanuel’)
                        10-11    웅가로가                                                                unggaro-ga (‘Ungaro-NOM’)
                        10       웅가로             웅가로          PROPN      NNP                         unggaro (‘Ungaro’)
                        11       가               가            ADP        JKS                         -ga (‘-NOM’)
                        12       실내              실내           NOUN       NNG                         silnae (‘interior’)
                        13-14    장식용                                                                 jangsikyong (‘decoration’)
                        13       장식              장식           NOUN       NNG                         jangsik (‘decoration’)
                        14       용               용            PART       XSN                         -yong (‘usage’)
                        15       직물              직물           NOUN       NNG                         jikmul (‘textile’)
                        16-17    디자이너로                                                               dijaineo-ro (‘designer-AJT’)
                        16       디자이너            디자이너 NOUN NNG                                       dijaineo (‘designer’)
                        17       로               로            ADP        JKB                         -ro (‘-AJT’)
                        18-20    나섰다                                              SpaceAfter=No      naseo-eoss-da (‘become-PAST-IND)
                        18       나서              나서           VERB       VV                          naseo (‘become’)
                        19       었               었            PART       EP                          -eoss (‘PAST’)
                        20       다               다            PART       EF                          -da (‘-IND)
                        21       .               .            PUNCT      SF
                     Figure3:TheproposedCoNLL-Ustyleannotationwithmulti-wordtokens(MWT)formorphologicalanalysisand
                     POStagging: a glossed example in provided in Figure 1.
                     including Na (2015), it would not be the fair to                  spectively. However, while the current CoNLL-U
                     make a direct comparison because the previous                     style UD annotation for Korean can simulate and
                     results used a different size of the Sejong cor-                  yield POS tagging annotation of the Sejong cor-
                                                                         5
                     pus and a different division of the corpus. (Jung                 pus, they cannot deal with NER or SRL tasks as
                     et al., 2018) showed 97.08% f1 score for their re-                we propose in §4. For example, a word like peu-
                     sults (instead of accuracy). They are measured by                 rangseuui (‘of France’) is segmented and anal-
                     the entire sequence of morphemes because of their                 ysed into peurangseu/PROPER NOUN and ui/GEN.
                     seq2seq model. Our accuracy is based on a word                    The current UD annotation for Korean makes
                     level measurement.                                                the lemma peurangseu+ui and makes NNP+JKG
                                                                                       language-speciﬁc POS tag, from which we can
                     3.3    ComparisonwiththecurrentUD                                 produce Sejong style POS tagging annotation:
                            annotation                                                 peurangseu/NNP+ui/JKG. While a named entity
                     There are currently two Korean treebanks avail-                   peurangseu (‘France’) should be recognised in-
                     able in UDv2.2:theGoogleKoreanUniversalDe-                        dependently, UD annotation for Korean does not
                     pendency Treebank (McDonald et al., 2013) and                     have any way to identify entities by themselves
                     the KAIST Korean Universal Dependency Tree-                       without case markers. In addition, as we de-
                     bank (Chun et al., 2018). For the lemma and                       scribed in §2.1 the number of POS patterns of
                     language-speciﬁc POS tag ﬁelds, they use anno-                    the word which is used in the language-speciﬁc
                     tation concatenation using the plus sign as shown                 POS tag ﬁeld does not converge. Recall that
                     in Figure 4. We note that Sejong and KAIST tag                    the language-speciﬁc POS tag is the sequence
                     sets are used as language-speciﬁc POS tags, re-                   of concatenated POS tags such as NNP+JKG
                                                                                       or NNG+XSN+VCP+ETM. The number of these
                        5Previous work often used cross validation or a corpus         POSpatternsisexponentialbecauseoftheaggluti-
                     split without speciﬁc corpus-splitting guidelines. This makes     native nature of words in Korean. However, it can
                     it difﬁcult to correctly compare the POS tagging results. For     be a serious problem for system implementation
                     future reference and to be able to reproduce the results, we
                     propose an explicit-split method for the Sejong POS tagged        if we want to deal with the entire Sejong corpus
                     corpus in Appendix C.
                                                                                  198
The words contained in this file might help you see if this file matches what you are looking for:

...Anewannotationschemeforthesejongpart of speechtaggedcorpus jungyeul park francis tyers department linguistics university at buffalo indiana edu ftyers abstract nnp jkg peurangseu ui france gen nng xsn vcp etm segye jeok i n world class rel uisang fashion in this paper we present a new annotation dijaineo designer emmanuel emanuel scheme for the sejong part speech tagged jks unggaro ga ungaro nom silnae interior corpus based on universal dependencies jangsikyong decoration jikmul textile style by using jkb ro ajt wecanproducesejong stylemorpho vv ep ef sf naseo eoss da become past ind logical analysis and tagging re figure examples pos sults which have been de facto standard theworldclassfrenchfashiondesigneremanuelun korean language processing also explore garo became decorations possibility doing named entity recogni seetableforpostaginformationinthesejongcor tion semantic role labelling us pus ing introduction postagsfortheentire annotated ministry culture tourism shows an example ko...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area