Language Pdf 99498

Partial capture of text on file.
                         ASub-CharacterArchitectureforKoreanLanguageProcessing
                                                                KarlStratos
                                                Toyota Technological Institute at Chicago
                                                          stratos@ttic.edu
                                      Abstract                                                 íßᆫ`¦ yᆻ다
                      We introduce a novel sub-character ar-                        íßᆫ`¦                     yᆻ다
                      chitecture that exploits a unique com-
                      positional structure of the Korean lan-                  íßᆫ          `¦          yᆻ          다
                      guage.    Our method decomposes each
                      character into a small set of primitive             ㅅ ㅏ ㄴ ㅇ ㅡ ㄹ ㄱ ㅏ ㅆ ㄷ ㅏ ∅
                      phonetic units called jamo letters from
                      which character- and word-level represen-          Figure 1: Korean sentence “íßᆫ`¦ yᆻ다” (I went to
                      tations are induced. The jamo letters di-          the mountain) decomposed to words, characters,
                      vulge syntactic and semantic information           and jamos.
                      that is difﬁcult to access with conventional
                      character-level units.  They greatly alle-         Figure 1 for an illustration of the decomposi-
                      viate the data sparsity problem, reducing          tion. The decomposition is deterministic; this is
                      the observation space to 1.6% of the orig-         a crucial departure from previous work that uses
                      inal while increasing accuracy in our ex-          language-speciﬁc sub-character information such
                      periments. We apply our architecture to            as radical (a graphical component of a Chinese
                      dependency parsing and achieve dramatic            character).   The radical structure of a Chinese
                      improvementoverstronglexicalbaselines.             character does not follow any systematic process,
                  1   Introduction                                       requiring an incomplete dictionary mapping be-
                                                                         tween characters and radicals to take advantage of
                  Korean is generally recognized as a language iso-      this information(Sunetal.,2014;Yinetal.,2016).
                  late: that is, it has no apparent genealogical rela-   In contrast, our Unicode decomposition does not
                  tionship with other languages (Song, 2006; Camp-       need any supervision and can extract correct jamo
                  bell and Mixco, 2007). A unique feature of the         letters for all possible Korean characters.
                  language is that each character is composed of a          Our jamo architecture is fully general and can
                  small, ﬁxedsetofbasicphoneticunitscalledjamo           bepluggedinanyKoreanprocessingnetwork. For
                  letters. Despite the important role jamo plays in      a concrete demonstration of its utility, in this work
                  encoding syntactic and semantic information of         wefocusondependencyparsing. McDonaldetal.
                  words, it has been neglected in existing modern        (2013) note that “Korean emerges as a very clear
                  Korean processing algorithms. In this paper, we        outlier” in their cross-lingual parsing experiments
                  bridge this gap by introducing a novel composi-        on the universal treebank, implying a need to tai-
                  tional neural architecture that explicitly leverages   lor a model for this language isolate. Because of
                  the sub-character information.                         the compositional morphology, Korean suffers ex-
                    Speciﬁcally, we perform Unicode decomposi-           treme data sparsity at the word level: 2,703 out of
                  tion on each Korean character to recover its un-       4,698 word types (> 57%) in the held-out portion
                  derlying jamo letters and construct character- and     of our treebank are OOV. This makes the language
                  word-level representations from these letters. See     challenging for simple lexical parsers even when
                                                                     721
                         Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 721–726
                                                                       c
                             Copenhagen, Denmark, September 7–11, 2017. 
2017 Association for Computational Linguistics
                 augmented with a large set of pre-trained word          Gillicketal.(2016)whoprocesstextasasequence
                 representations.                                        of bytes. We believe that such byte-level models
                    While such data sparsity can also be alleviated      are too general and that there are opportunities to
                 by incorporating more conventional character-           exploit natural sub-character structure for certain
                 level information, we show that incorporating           languages such as Korean and Chinese.
                 jamoisaneffective and economical new approach             There exists a line of work on exploiting graph-
                 to combating the sparsity problem for Korean. In        ical components of Chinese characters called rad-
                 experiments, we decisively improve the LAS of           icals (Sun et al., 2014; Yin et al., 2016). For in-
                 the lexical BiLSTM parser of Kiperwasser and            stance, 足 (foot) is the radical of 跑 (run). While
                 Goldberg(2016)from82.77to91.46whilereduc-               related, our work on Korean is distinguished in
                 ing the size of input space by 98.4% when we re-        critical ways and should not be thought of as
                 place words with jamos. As a point of reference,        just an extension to another language. First, as
                 a strong feature-rich parser using gold POS tags        mentioned earlier, the compositional structure is
                 obtains 88.61.                                          fundamentally different between Chinese and Ko-
                    Tosummarize,wemakethefollowingcontribu-              rean. The mapping between radicals and charac-
                 tions.                                                  ters in Chinese is nondeterministic and can only be
                    • To our knowledge, this is the ﬁrst work            looselyapproximatedbyanincompletedictionary.
                       that leverages jamo in end-to-end neural Ko-      In contrast, the mapping between jamos and Ko-
                       rean processing. To this end, we develop a        rean characters is deterministic (Section 3.1), al-
                       novel sub-character architecture based on de-     lowing for systematic decomposition of all possi-
                       terministic Unicode decomposition.                ble Korean characters. Second, the previous work
                                                                         on Chinese radicals was concerned with learn-
                    • Weperformextensiveexperimentsondepen-              ing word embeddings. We develop an end-to-end
                       dency parsing to verify the utility of the ap-    compositional model for a downstream task: pars-
                       proach. We show clear performance boost           ing.
                       with a drastically smaller set of parameters.     3   Method
                       Ourﬁnalmodeloutperformsstrongbaselines
                       by a large margin.                                3.1   JamoStructureoftheKoreanLanguage
                    • Wereleaseanimplementationofourjamoar-              Let W denote the set of word types and C the set
                       chitecture which can be plugged in any Ko-        of character types. In many languages, c ∈ C is
                       rean processing network.1                         the most basic unit that is meaningful. In Korean,
                 2    Related Work                                       eachcharacterisfurthercomposedofasmallﬁxed
                                                                         set of phonetic units called jamo letters J where
                 We make a few additional remarks on related             |J| = 51. Thejamolettersarecategorizedashead
                 work to better situate our work. Our work fol-          consonants Jh, vowels Jv, or tail consonants Jt.
                 lows the successful line of work on incorporating       The composition is completely systematic. Given
                                                                         any character c ∈ C, there exist c ∈ J , cv ∈ Jv,
                 sub-lexical information to neural models. Vari-                                           h     h
                                                                         and c ∈ J such that their composition yields c.
                 ous character-based architectures have been pro-             t      t
                                                                         Conversely, any c    ∈ J , c ∈ J , and c ∈ J
                 posed. Forinstance,MaandHovy(2016)andKim                                  h      h   v      v       t     t
                 et al. (2016) use CNNs over characters whereas          can be composed to yield a valid character c ∈ C.
                 Lample et al. (2016) and Ballesteros et al. (2015)        Asanexample,considerthewordyᆻ다(went).
                 use bidirectional LSTMs (BiLSTMs). Both ap-             It is composed of two characters, yᆻ,다 ∈ C. Each
                 proacheshavebeenshowntobeproﬁtable;weem-                character is furthermore composed of three jamo
                 ploy a BiLSTM-based approach.                           letters as follows:
                    Many previous works have also considered               • yᆻ ∈ C is composed of ㄱ ∈ Jh, ㅏ ∈ Jv,
                 morphemes to augment lexical models (Luong                   and ㅆ ∈ J .
                                                                                          t
                 et al., 2013; Botha and Blunsom, 2014; Cotterell          • 다 ∈ C is composed of ㄷ ∈ J , ㅏ ∈ J ,
                 et al., 2016). Sub-character models are substan-                                              h          v
                                                                              and an empty letter ∅ ∈ J .
                 tially rarer; an extreme case is considered by                                          t
                    1https://github.com/karlstratos/                     The tail consonant can be empty; we assume a
                 koreannet                                               special symbol ∅ ∈ Jt to denote an empty letter.
                                                                     722
                 Figure 1 illustrates the decomposition of a Korean    and induce a representation of w as
                 sentence down to jamo letters.                                                            
                    Note that the number of possible characters                                     fc
                                                                                   w             C   m      C
                 is combinatorial in the number of jamo letters,                 h =tanh U          bc   +b
                                                                                                     1
                 loosely upper bounded by 513 = 132,651. This
                 upper bound is loose because certain combina-         Lastly, this representation is concatenated with a
                 tions are invalid. For instance, ㅁ ∈ J ∩ Jt but       word-level lookup embedding (which can be ini-
                                                         h             tialized with pre-trained word embeddings), and
                 ㅁ6∈J whereasㅏ∈J butㅏ6∈J ∪J.
                        v                 v            h     t         the result is fed into a BiLSTM network. The pa-
                    The combinatorial nature of Korean characters
                 motivates the compositional architecture below.       rameters associated with this layer are
                 For completeness, we describe the entire forward
                                                                                          w     dW
                 pass of the transition-based BiLSTM parser of            • Embeddinge ∈R           for each w ∈ W
                 Kiperwasser and Goldberg (2016) that we use in           • Two-layer BiLSTM Φ that maps h ...h ∈
                 our experiments.                                                                              1      n
                                                                              d+d                   d∗
                                                                            R Wtoz1...zn ∈R
                 3.2   JamoArchitecture
                 Theparameters associated with the jamo layer are         • Feedforward for predicting transitions
                                   l     d                             Given a sentence w1...wn ∈ W, the ﬁnal d∗-
                    • Embeddinge ∈ R foreachletter l ∈ J               dimensional word representations are given by
                    • UJ,VJ,WJ ∈Rd×dandbJ ∈Rd                                                  w        w 
                                                                              (z ...z ) = Φ      h 1 ... h n
                                                                                1     n           w          wn
                 Given a Korean character c ∈ C, we perform Uni-                                 e 1        e
                 code decomposition (Section 3.3) to recover the
                 underlying jamo letters c ,c ,c ∈ J. We com-          The parser then uses the feedforward network to
                                           h  v  t                     greedilypredicttransitionsbasedonwordsthatare
                 pose the letters to induce a representation of c as
                                                                     active in the system. The model is trained end-to-
                    c            J ch      J cv       J ct     J       end by optimizing a max-margin objective. Since
                   h =tanh U e +V e +W e +b
                                                                       this part is not a contribution of this paper, we refer
                 This representation is then concatenated with a       to Kiperwasser and Goldberg (2016) for details.
                 character-level lookup embedding, and the result         By setting the embedding dimension of jamos
                 is fed into an LSTM to produce a word representa-     d, characters d′, or words dW to zero, we can con-
                 tion. WeuseanLSTM(HochreiterandSchmidhu-              ﬁgurethenetworktouseanycombinationofthese
                                                       d      d
                 ber, 1997) simply as a mapping φ : R 1 × R 2 →        units. We report these experiments in Section 4.
                 Rd2 that takes an input vector x and a state vector
                                                  ′                    3.3   Unicode Decomposition
                 h to output a new state vector h = φ(x,h). The
                 parameters associated with this layer are             Our architecture requires dynamically extracting
                                   c     d′                            jamo letters given any Korean character. This is
                    • Embeddinge ∈ R foreachc ∈ C                      achieved by simple Unicode manipulation. For
                                                 ′
                    • Forward LSTMφf : Rd+d ×Rd → Rd                   any Korean character c ∈ C with Unicode value
                                                  ′                    U(c), let U(c) = U(c) − 44032 and T(c) =
                    • BackwardLSTMφb : Rd+d ×Rd → Rd                   U(c) mod 28. Then the Unicode values U(c ),
                                                                                                                       h
                                                                       U(cv), and U(ct) corresponding to the head con-
                    • UC ∈ Rd×2d and bC ∈ Rd                           sonant, vowel, and tail consonant are obtained by
                 Given a word w ∈ W and its character sequence           U(c ) = 1+U(c)+0x10ff
                 c1 . . . cm ∈ C, we compute                                 h         588
                               c                                      U(c ) = 1+(U(c)−T(c)) mod 588+0x1160
                     fc = φf      h i , fc        ∀i = 1...m                 v                  28
                       i          eci    i−1                              U(c ) = 1+T(c)+0x11a7
                                                                         t
                                   ci
                       c     b   h      c
                      b =φ            , b         ∀i = m...1
                       i           ci   i+1
                                 e                                     where c is set to ∅ if T(c ) = 0.
                                                                               t                 t
                                                                   723
                                            Training    Development    Test          ㄱㄳㄲㄵㄴㄷㄶㄹㄸㄻㄺㄼㅁ
                       # projective trees     5,425         603         299          ㅀㅃㅂㅅㅄㅇㅆㅉㅈㅋㅊㅍㅌ
                     # non-projective trees    12            0           0           ㅏㅎㅑㅐㅓㅒㅕㅔㅗㅖㅙㅘㅛㅚ
                                #      # Ko             Examples                     ㅝㅜㅟㅞㅡㅠㅣㅢ
                     word    31,060      –       áÔÐÕᅳÏþÐ다 °úᆯqᅵ booz
                      char    1,772    1,315   þjÏäJ <ÉÌ zªᆼ $Â	 H Aá¤ @ 正 a    none of which is OOV in the dev set.
                     jamo      500      48      ㄱㄳㄼㅏㅠㅢ@正a
                   Table 1: Treebank statistics. Upper: Number of                Implementation and baselines            We implement
                   trees in the split. Lower: Number of unit types               our jamo architecture using the DyNet library
                   in the training portion. For simplicity, we include           (Neubig et al., 2017) and plug it into the BiLSTM
                                                                                                                                    3
                   non-Korean symbols (e.g., @, 正, a) as charac-                 parser of Kiperwasser and Goldberg (2016). For
                   ters/jamos.                                                   Korean syllable manipulation, we use the freely
                                                                                                                         4
                                                                                 available toolkit by Joshua Dong.          We train the
                   3.4    WhyUseJamoLetters?                                     parser for 30 epochs and use the dev portion for
                                                                                 model selection. We compare our approach to the
                   The most obvious beneﬁt of using jamo letters is              following baselines:
                   alleviating data sparsity by ﬂattening the combi-                • McDonald13: A cross-lingual parser origi-
                   natorial space of Korean characters. We discuss                     nally reported in McDonald et al. (2013).
                   some additional explicit beneﬁts. First, jamo let-
                   ters often indicate syntactic properties of words.               • Yara: A beam-search transition-based parser
                   For example, a tail consonant ㅆ strongly implies                    of Rasooli and Tetreault (2015) based on the
                   that the word is a past tense verb as in yᆻ다                      rich non-local features in Zhang and Nivre
                   (went), M®o다 (came), and Ùþ¡다 (did). Thus a                       (2011).    We use beam width 64. We use
                   jamo-level model can identify unseen verbs more                     5-fold jackkniﬁng on the training portion to
                   effectively than word- or character-level models.                   provide POS tag features. We also report on
                   Second, jamo letters dictate the sound of a char-                   using gold POS tags.
                   acter. For example, yᆻ is pronounced as got be-
                   cause the head consonant ㄱ is associated with the                • K&G16: ThebasicBiLSTMparserofKiper-
                   sound g, the vowel ㅏ with o, and the tail conso-                    wasser and Goldberg (2016) without the sub-
                   nant ㅆ with t. This is clearly critical for speech                  lexical architecture introduced in this work.
                   recognition/synthesis and indeed has been investi-               • Stack LSTM: A greedy transition-based
                   gated in the speech community (Lee et al., 1994;                    parser based on stack LSTM representa-
                   Sakti et al., 2010). While speech processing is not                 tions.  Dyer15 denotes the word-level vari-
                   our focus, the phonetic signals can capture useful                  ant (Dyer et al., 2015). Ballesteros15 denotes
                   lexical correlation (e.g., for onomatopoeic words).                 the character-level variant (Ballesteros et al.,
                   4    Experiments                                                    2015).
                   Data Weusethepubliclyavailable Korean tree-                   For pre-trained word embeddings, we apply the
                   bank in the universal treebank version 2.0 (Mc-               spectral algorithm of Stratos et al. (2015) on a
                   Donald et al., 2013).2       The dataset comes with           2015 Korean Wikipedia dump to induce 285,933
                   a train/development/test split; data statistics are           embeddings of dimension 100.
                   shown in Table 1. Since the test portion is sig-              Parsing accuracy        Table 2 shows the main re-
                   niﬁcantly smaller than the dev portion, we report             sult. The baseline test LAS of the original cross-
                   performance on both.                                          lingual parser of McDonald13 is 55.85.              Yara
                      As expected, we observe severe data sparsity               achieves 85.17 with predicted POS tags and 88.61
                   with words: 24,814 out of 31,060 elements in the              with gold POS tags. The basic BiLSTM model
                   vocabulary appear only once in the training data.             of K&G16 obtains 82.77 with pre-trained word
                   On the dev set, about 57% word types and 3%                   embeddings (78.95 without).          The stack LSTM
                   character types are OOV. Upon Unicode decom-                  parser is comparable to K&G16 at the word level
                   position, we obtain the following 48 jamo types:                 3https://github.com/elikip/bist-parser
                                                                                    4https://github.com/JDongian/
                       2https://github.com/ryanmcd/uni-dep-tb                    python-jamo
                                                                             724
The words contained in this file might help you see if this file matches what you are looking for:

...Asub characterarchitectureforkoreanlanguageprocessing karlstratos toyota technological institute at chicago stratos ttic edu abstract i y we introduce a novel sub character ar chitecture that exploits unique com positional structure of the korean lan guage our method decomposes each into small set primitive phonetic units called jamo letters from which and word level represen figure sentence went to tations are induced di mountain decomposed words characters vulge syntactic semantic information jamos is difcult access with conventional they greatly alle for an illustration decomposi viate data sparsity problem reducing tion decomposition deterministic this observation space orig crucial departure previous work uses inal while increasing accuracy in ex language specic such periments apply architecture as radical graphical component chinese dependency parsing achieve dramatic improvementoverstronglexicalbaselines does not follow any systematic process introduction requiring incomplete di...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area