126x Filetype PDF File size 0.22 MB Source: aclanthology.org
ASub-CharacterArchitectureforKoreanLanguageProcessing KarlStratos Toyota Technological Institute at Chicago stratos@ttic.edu Abstract íßᆫ`¦ yᆻ다 We introduce a novel sub-character ar- íßᆫ`¦ yᆻ다 chitecture that exploits a unique com- positional structure of the Korean lan- íßᆫ `¦ yᆻ 다 guage. Our method decomposes each character into a small set of primitive ㅅ ㅏ ㄴ ㅇ ㅡ ㄹ ㄱ ㅏ ㅆ ㄷ ㅏ ∅ phonetic units called jamo letters from which character- and word-level represen- Figure 1: Korean sentence “íßᆫ`¦ yᆻ다” (I went to tations are induced. The jamo letters di- the mountain) decomposed to words, characters, vulge syntactic and semantic information and jamos. that is difficult to access with conventional character-level units. They greatly alle- Figure 1 for an illustration of the decomposi- viate the data sparsity problem, reducing tion. The decomposition is deterministic; this is the observation space to 1.6% of the orig- a crucial departure from previous work that uses inal while increasing accuracy in our ex- language-specific sub-character information such periments. We apply our architecture to as radical (a graphical component of a Chinese dependency parsing and achieve dramatic character). The radical structure of a Chinese improvementoverstronglexicalbaselines. character does not follow any systematic process, 1 Introduction requiring an incomplete dictionary mapping be- tween characters and radicals to take advantage of Korean is generally recognized as a language iso- this information(Sunetal.,2014;Yinetal.,2016). late: that is, it has no apparent genealogical rela- In contrast, our Unicode decomposition does not tionship with other languages (Song, 2006; Camp- need any supervision and can extract correct jamo bell and Mixco, 2007). A unique feature of the letters for all possible Korean characters. language is that each character is composed of a Our jamo architecture is fully general and can small, fixedsetofbasicphoneticunitscalledjamo bepluggedinanyKoreanprocessingnetwork. For letters. Despite the important role jamo plays in a concrete demonstration of its utility, in this work encoding syntactic and semantic information of wefocusondependencyparsing. McDonaldetal. words, it has been neglected in existing modern (2013) note that “Korean emerges as a very clear Korean processing algorithms. In this paper, we outlier” in their cross-lingual parsing experiments bridge this gap by introducing a novel composi- on the universal treebank, implying a need to tai- tional neural architecture that explicitly leverages lor a model for this language isolate. Because of the sub-character information. the compositional morphology, Korean suffers ex- Specifically, we perform Unicode decomposi- treme data sparsity at the word level: 2,703 out of tion on each Korean character to recover its un- 4,698 word types (> 57%) in the held-out portion derlying jamo letters and construct character- and of our treebank are OOV. This makes the language word-level representations from these letters. See challenging for simple lexical parsers even when 721 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 721–726 c Copenhagen, Denmark, September 7–11, 2017. 2017 Association for Computational Linguistics augmented with a large set of pre-trained word Gillicketal.(2016)whoprocesstextasasequence representations. of bytes. We believe that such byte-level models While such data sparsity can also be alleviated are too general and that there are opportunities to by incorporating more conventional character- exploit natural sub-character structure for certain level information, we show that incorporating languages such as Korean and Chinese. jamoisaneffective and economical new approach There exists a line of work on exploiting graph- to combating the sparsity problem for Korean. In ical components of Chinese characters called rad- experiments, we decisively improve the LAS of icals (Sun et al., 2014; Yin et al., 2016). For in- the lexical BiLSTM parser of Kiperwasser and stance, 足 (foot) is the radical of 跑 (run). While Goldberg(2016)from82.77to91.46whilereduc- related, our work on Korean is distinguished in ing the size of input space by 98.4% when we re- critical ways and should not be thought of as place words with jamos. As a point of reference, just an extension to another language. First, as a strong feature-rich parser using gold POS tags mentioned earlier, the compositional structure is obtains 88.61. fundamentally different between Chinese and Ko- Tosummarize,wemakethefollowingcontribu- rean. The mapping between radicals and charac- tions. ters in Chinese is nondeterministic and can only be • To our knowledge, this is the first work looselyapproximatedbyanincompletedictionary. that leverages jamo in end-to-end neural Ko- In contrast, the mapping between jamos and Ko- rean processing. To this end, we develop a rean characters is deterministic (Section 3.1), al- novel sub-character architecture based on de- lowing for systematic decomposition of all possi- terministic Unicode decomposition. ble Korean characters. Second, the previous work on Chinese radicals was concerned with learn- • Weperformextensiveexperimentsondepen- ing word embeddings. We develop an end-to-end dency parsing to verify the utility of the ap- compositional model for a downstream task: pars- proach. We show clear performance boost ing. with a drastically smaller set of parameters. 3 Method Ourfinalmodeloutperformsstrongbaselines by a large margin. 3.1 JamoStructureoftheKoreanLanguage • Wereleaseanimplementationofourjamoar- Let W denote the set of word types and C the set chitecture which can be plugged in any Ko- of character types. In many languages, c ∈ C is rean processing network.1 the most basic unit that is meaningful. In Korean, 2 Related Work eachcharacterisfurthercomposedofasmallfixed set of phonetic units called jamo letters J where We make a few additional remarks on related |J| = 51. Thejamolettersarecategorizedashead work to better situate our work. Our work fol- consonants Jh, vowels Jv, or tail consonants Jt. lows the successful line of work on incorporating The composition is completely systematic. Given any character c ∈ C, there exist c ∈ J , cv ∈ Jv, sub-lexical information to neural models. Vari- h h and c ∈ J such that their composition yields c. ous character-based architectures have been pro- t t Conversely, any c ∈ J , c ∈ J , and c ∈ J posed. Forinstance,MaandHovy(2016)andKim h h v v t t et al. (2016) use CNNs over characters whereas can be composed to yield a valid character c ∈ C. Lample et al. (2016) and Ballesteros et al. (2015) Asanexample,considerthewordyᆻ다(went). use bidirectional LSTMs (BiLSTMs). Both ap- It is composed of two characters, yᆻ,다 ∈ C. Each proacheshavebeenshowntobeprofitable;weem- character is furthermore composed of three jamo ploy a BiLSTM-based approach. letters as follows: Many previous works have also considered • yᆻ ∈ C is composed of ㄱ ∈ Jh, ㅏ ∈ Jv, morphemes to augment lexical models (Luong and ㅆ ∈ J . t et al., 2013; Botha and Blunsom, 2014; Cotterell • 다 ∈ C is composed of ㄷ ∈ J , ㅏ ∈ J , et al., 2016). Sub-character models are substan- h v and an empty letter ∅ ∈ J . tially rarer; an extreme case is considered by t 1https://github.com/karlstratos/ The tail consonant can be empty; we assume a koreannet special symbol ∅ ∈ Jt to denote an empty letter. 722 Figure 1 illustrates the decomposition of a Korean and induce a representation of w as sentence down to jamo letters. Note that the number of possible characters fc w C m C is combinatorial in the number of jamo letters, h =tanh U bc +b 1 loosely upper bounded by 513 = 132,651. This upper bound is loose because certain combina- Lastly, this representation is concatenated with a tions are invalid. For instance, ㅁ ∈ J ∩ Jt but word-level lookup embedding (which can be ini- h tialized with pre-trained word embeddings), and ㅁ6∈J whereasㅏ∈J butㅏ6∈J ∪J. v v h t the result is fed into a BiLSTM network. The pa- The combinatorial nature of Korean characters motivates the compositional architecture below. rameters associated with this layer are For completeness, we describe the entire forward w dW pass of the transition-based BiLSTM parser of • Embeddinge ∈R for each w ∈ W Kiperwasser and Goldberg (2016) that we use in • Two-layer BiLSTM Φ that maps h ...h ∈ our experiments. 1 n d+d d∗ R Wtoz1...zn ∈R 3.2 JamoArchitecture Theparameters associated with the jamo layer are • Feedforward for predicting transitions l d Given a sentence w1...wn ∈ W, the final d∗- • Embeddinge ∈ R foreachletter l ∈ J dimensional word representations are given by • UJ,VJ,WJ ∈Rd×dandbJ ∈Rd w w (z ...z ) = Φ h 1 ... h n 1 n w wn Given a Korean character c ∈ C, we perform Uni- e 1 e code decomposition (Section 3.3) to recover the underlying jamo letters c ,c ,c ∈ J. We com- The parser then uses the feedforward network to h v t greedilypredicttransitionsbasedonwordsthatare pose the letters to induce a representation of c as active in the system. The model is trained end-to- c J ch J cv J ct J end by optimizing a max-margin objective. Since h =tanh U e +V e +W e +b this part is not a contribution of this paper, we refer This representation is then concatenated with a to Kiperwasser and Goldberg (2016) for details. character-level lookup embedding, and the result By setting the embedding dimension of jamos is fed into an LSTM to produce a word representa- d, characters d′, or words dW to zero, we can con- tion. WeuseanLSTM(HochreiterandSchmidhu- figurethenetworktouseanycombinationofthese d d ber, 1997) simply as a mapping φ : R 1 × R 2 → units. We report these experiments in Section 4. Rd2 that takes an input vector x and a state vector ′ 3.3 Unicode Decomposition h to output a new state vector h = φ(x,h). The parameters associated with this layer are Our architecture requires dynamically extracting c d′ jamo letters given any Korean character. This is • Embeddinge ∈ R foreachc ∈ C achieved by simple Unicode manipulation. For ′ • Forward LSTMφf : Rd+d ×Rd → Rd any Korean character c ∈ C with Unicode value ′ U(c), let U(c) = U(c) − 44032 and T(c) = • BackwardLSTMφb : Rd+d ×Rd → Rd U(c) mod 28. Then the Unicode values U(c ), h U(cv), and U(ct) corresponding to the head con- • UC ∈ Rd×2d and bC ∈ Rd sonant, vowel, and tail consonant are obtained by Given a word w ∈ W and its character sequence U(c ) = 1+U(c)+0x10ff c1 . . . cm ∈ C, we compute h 588 c U(c ) = 1+(U(c)−T(c)) mod 588+0x1160 fc = φf h i , fc ∀i = 1...m v 28 i eci i−1 U(c ) = 1+T(c)+0x11a7 t ci c b h c b =φ , b ∀i = m...1 i ci i+1 e where c is set to ∅ if T(c ) = 0. t t 723 Training Development Test ㄱㄳㄲㄵㄴㄷㄶㄹㄸㄻㄺㄼㅁ # projective trees 5,425 603 299 ㅀㅃㅂㅅㅄㅇㅆㅉㅈㅋㅊㅍㅌ # non-projective trees 12 0 0 ㅏㅎㅑㅐㅓㅒㅕㅔㅗㅖㅙㅘㅛㅚ # # Ko Examples ㅝㅜㅟㅞㅡㅠㅣㅢ word 31,060 – áÔÐÕᅳÏþÐ다 °úᆯqᅵ booz char 1,772 1,315 þjÏäJ <ÉÌ zªᆼ $ H Aᤠ@ 正 a none of which is OOV in the dev set. jamo 500 48 ㄱㄳㄼㅏㅠㅢ@正a Table 1: Treebank statistics. Upper: Number of Implementation and baselines We implement trees in the split. Lower: Number of unit types our jamo architecture using the DyNet library in the training portion. For simplicity, we include (Neubig et al., 2017) and plug it into the BiLSTM 3 non-Korean symbols (e.g., @, 正, a) as charac- parser of Kiperwasser and Goldberg (2016). For ters/jamos. Korean syllable manipulation, we use the freely 4 available toolkit by Joshua Dong. We train the 3.4 WhyUseJamoLetters? parser for 30 epochs and use the dev portion for model selection. We compare our approach to the The most obvious benefit of using jamo letters is following baselines: alleviating data sparsity by flattening the combi- • McDonald13: A cross-lingual parser origi- natorial space of Korean characters. We discuss nally reported in McDonald et al. (2013). some additional explicit benefits. First, jamo let- ters often indicate syntactic properties of words. • Yara: A beam-search transition-based parser For example, a tail consonant ㅆ strongly implies of Rasooli and Tetreault (2015) based on the that the word is a past tense verb as in yᆻ다 rich non-local features in Zhang and Nivre (went), M®o다 (came), and Ùþ¡다 (did). Thus a (2011). We use beam width 64. We use jamo-level model can identify unseen verbs more 5-fold jackknifing on the training portion to effectively than word- or character-level models. provide POS tag features. We also report on Second, jamo letters dictate the sound of a char- using gold POS tags. acter. For example, yᆻ is pronounced as got be- cause the head consonant ㄱ is associated with the • K&G16: ThebasicBiLSTMparserofKiper- sound g, the vowel ㅏ with o, and the tail conso- wasser and Goldberg (2016) without the sub- nant ㅆ with t. This is clearly critical for speech lexical architecture introduced in this work. recognition/synthesis and indeed has been investi- • Stack LSTM: A greedy transition-based gated in the speech community (Lee et al., 1994; parser based on stack LSTM representa- Sakti et al., 2010). While speech processing is not tions. Dyer15 denotes the word-level vari- our focus, the phonetic signals can capture useful ant (Dyer et al., 2015). Ballesteros15 denotes lexical correlation (e.g., for onomatopoeic words). the character-level variant (Ballesteros et al., 4 Experiments 2015). Data Weusethepubliclyavailable Korean tree- For pre-trained word embeddings, we apply the bank in the universal treebank version 2.0 (Mc- spectral algorithm of Stratos et al. (2015) on a Donald et al., 2013).2 The dataset comes with 2015 Korean Wikipedia dump to induce 285,933 a train/development/test split; data statistics are embeddings of dimension 100. shown in Table 1. Since the test portion is sig- Parsing accuracy Table 2 shows the main re- nificantly smaller than the dev portion, we report sult. The baseline test LAS of the original cross- performance on both. lingual parser of McDonald13 is 55.85. Yara As expected, we observe severe data sparsity achieves 85.17 with predicted POS tags and 88.61 with words: 24,814 out of 31,060 elements in the with gold POS tags. The basic BiLSTM model vocabulary appear only once in the training data. of K&G16 obtains 82.77 with pre-trained word On the dev set, about 57% word types and 3% embeddings (78.95 without). The stack LSTM character types are OOV. Upon Unicode decom- parser is comparable to K&G16 at the word level position, we obtain the following 48 jamo types: 3https://github.com/elikip/bist-parser 4https://github.com/JDongian/ 2https://github.com/ryanmcd/uni-dep-tb python-jamo 724
no reviews yet
Please Login to review.