113x Filetype PDF File size 0.12 MB Source: aclanthology.org
CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology 71 Term Extraction from Korean Corpora via Japanese Atsushi Fujii, Tetsuya Ishikawa Jong-Hyeok Lee Graduate School of Library, Division of Electrical and Information and Media Studies Computer Engineering, University of Tsukuba Pohang University of Science and Technology, 1-2 Kasuga, Tsukuba Advanced Information Technology Research Center 305-8550, Japan San 31 Hyoja-dong Nam-gu, {fujii,ishikawa}@slis.tsukuba.ac.jp Pohang 790-784, Republic of Korea jhlee@postech.ac.kr Abstract large number of foreign words in Korean can also be This paper proposes a method to extract foreign foreign words in Japanese. words, such as technical terms and proper nouns, In addition, the foreign words in Korean and from Korean corpora and produce a Japanese- Japanese corresponding to the same source word are Korean bilingual dictionary. Specific words have phonetically similar. For example, the English word been imported into multiple countries simultane- “system” has been imported into both Japanese and ously, if they are influential across cultures. The Korean. The romanized words are /sisutemu/and pronunciation of a source word is similar in different /siseutem/ in both countries, respectively. languages. Our method extracts words in Korean Motivated by these assumptions, we propose a corpora that are phonetically similar to Katakana method to extract foreign words in Korean corpora words, whichcaneasilybeidentifiedinJapanesecor- by means of Japanese. In brief, our method per- pora. We also show the effectiveness of our method forms as follows. First, foreign words in Japanese by means of experiments. are collected, for which Katakana words in corpora and existing lexicons can be used. Second, from Ko- 1 Introduction rean corpora the words that are phonetically similar to Katakana words are extracted. Finally, extracted Reflecting the rapid growth in science and tech- Koreanwordsarecompiledinalexiconwiththecor- nology, new words have progressively been created. responding Japanese words. However, due to the limitation of manual compila- Insummary,ourmethodcanextractforeignwords tion, new words are often out-of-dictionary words in Korean and produce a Japanese-Korean bilingual and decrease the quality of human language tech- lexicon in a single framework. nology, such as natural language processing, infor- mation retrieval, machine translation, and speech 2 Methodology recognition. To resolve this problem, a number 2.1 Overview of automatic methods to extract monolingual and bilingual lexicons from corpora have been proposed Figure 1 exemplifies our extraction method, which for various languages. produces a Japanese-Korean bilingual lexicon using In this paper, we focus on extracting foreign words a Korean corpus and Japanese corpus and/or lexi- (or loanwords) in Korean. Technical terms and con. The Japanese and Korean corpora do not have proper nouns are often imported from foreign lan- to be parallel or comparable. However, it is desir- guages and are spelled out (or transliterated) by the able that both corpora are associated with the same Korean alphabet system called Hangul. The similar domain. For the Japanese resource, the corpus and trend can be observable in Japanese and Chinese. In lexicon can alternatively be used or can be used to- Japanese, foreign words are spelled out by its special gether. Note that compiling Japanese monolingual phonetic alphabet (or phonogram) called Katakana. lexicon is less expensive than that for a bilingual lex- Thus, foreign words can be extracted from Japanese icon. In addition, new Katakana words can easily be corpora with a high accuracy, because the Katakana extracted from a number of on-line resources, such characters are seldom used to describe the conven- as the World Wide Web. Thus, the use of Japanese tional Japanese words, excepting proper nouns. lexicons does not decrease the utility of our method. However, extracting foreign words from Korean First, we collect Katakana words from Japanese corpora is more dicult, because in Korean both resources. This can systematically be performed by the conventional and foreign words are written with means of a Japanese character code, such as EUC- Hangul characters. This problem remains a chal- JP and SJIS. lenging issue in computational linguistic research. Second, we represent the Korean corpus and It is often the case that specific words have been Japanese Katakana words by the Roman alphabet imported into multiple countries simultaneously, be- (i.e., romanization), so that the phonetic similarity cause the source words (or concepts) are usually in- can easily be computed. However, we use different fluential across cultures. Thus, it is feasible that a romanization methods for Japanese and Korean. 72 CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology Third, we extract candidates of foreign words 2.3 Romanizing Korean from the romanized Korean corpus. An alternative The number of Korean Hangul characters is much method is to first perform morphological analysis greater than that of Japanese Katakana characters. on the corpus, extract candidate words based on Each Hangul character is a combination of more morphemes and parts-of-speech, and romanize the thanoneconsonant. Thepronunciationofeachchar- extracted words. Our general model does not con- acter is determined by its component consonants. strain as to which method should be used in the In Korean, there are types of consonant, i.e., the third step. However, because the accuracy of anal- first consonant, vowel, and last consonant. The ysis often decreases for new words to be extracted, numbers of these consonants are 19, 21, and 27, re- we experimentally adopt the former method. spectively. The last consonant is optional. Thus, the Finally, we compute the phonetic similarity be- number of combined characters is 11,172. However, tween each combination of the romanized Hangul to transliterate imported words, the ocial guide- and Katakana words, and select the combinations line suggests that only seven consonants be used as whose score is above a predefined threshold. As a the last consonant. In EUC-KR, which is a stan- result, we can obtain a Japanese-Korean bilingual dard coding system for Korean text, 2,350 common lexicon consisting of foreign words. characters are coded independent of the pronunci- It may be argued that English lexicons or cor- ation. Therefore, if we target corpora represented pora can be used as source information, instead of by EUC-KR, each of the 2,350 characters has to be Japanese resources. However, because not all En- corresponded to its Roman representation. glish words have been imported into Korean, the We use Unicode, in which Hangul characters are extraction accuracy will decrease due to extraneous sorted according to the pronunciation. Figure 2 de- words. picts a fragment of the Unicode table for Korean, in which each line corresponds to a combination of the first consonant and vowel and each column corresponds to the last consonant. The number of columnsis 28, i.e., the number of the last consonants and the case in which the last consonant is not used. From this figure, the following rules can be found: • thefirstconsonantchangesevery21lines, which corresponds to the number of vowels, • the vowel changes every line (i.e., 28 characters) and repeats every 21 lines, • the last consonant changes every column. Based on these rules, each character and its pro- nunciation can be identified by the three consonant types. Thus, we manually corresponded only the 68 consonants to Roman alphabets. Figure 1: Overview of our extraction method. 2.2 Romanizing Japanese BecausethenumberofphonesconsistingofJapanese Katakana characters is limited, we manually pro- duced the correspondence between each phone and its Roman representation. The numbers of Katakana characters and combined phones are 73 Figure 2: A fragment of the Unicode table for Ko- and 109, respectively. We also defined a symbol to rean Hangul characters. represent a long vowel. In Japanese, the Hepbern and Kunrei systems are commonly used for roman- ization purposes. We use the Hepburn system, be- We use the ocial romanization system for Ko- cause its representation is similar to that in Korean, rean, but specific Korean phones are adapted to compared with the Kunrei system. Japanese. For example, /j/ and /l/ are converted However, specific Japanese phones, such as /ti/, to /z/ and /r/, respectively. do not exist in Korean. Thus, to adapt the Hepburn It should be noted that the adaptation is not in- system to Korean, /ti/ and /tu/ are converted to vertible and thus is needed for both J-to-K and K- /chi/ and /chu/, respectively. to-J directions. CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology 73 For example, the English word “cheese”, which parametric constant used to control the importance has been imported to both Korean and Japanese as of the consonants. We experimentally set α =2.In a foreign word, is romanized as /chiseu/ in Korean addition, c and v denote the numbers of all conso- and /ti:zu/ in Japanese. Here, /:/isthesymbol nants and vowels in the two strings. The similarity representing a Japanese long vowel. Using the adap- ranges from 0 to 1. tation, these expressions are converted to /chizu/ and /chi:zu/, respectively, which look more similar 3 Experimentation to each other, compared with the original strings. 3.1 Evaluating Extraction Accuracy 2.4 Extracting term candidates from We collected 111,166 Katakana words (word types) Korean corpora from multiple Japanese lexicons, most of which were To extract candidates of foreign words from a Ko- technical term dictionaries. rean corpus, we first extract phrases. This can be WeusedtheKoreandocumentsetintheNTCIR-3 performed systematically, because Korean sentences Cross-lingual Information Retrieval test collection2. are segmented on a phrase-by-phrase basis. This document set consists of 66,146 newspaper ar- Second, because foreign words are usually nouns, ticles of Korean Economic Daily published in 1994. we use hand-crafted rules to remove post-position We randomly selected 50 newspaper articles and suxes (e.g., Josa) and extract nouns from phrases. used them for our experiment. We asked a grad- Third, we discard nouns including the last con- uate student excluding the authors of this paper to sonants that are not recommended for translitera- identify foreign words in the target text. As a result, tion purposes in the ocial guideline. Although the 124 foreign word types (205 word tokens) were iden- guideline suggests other rules for transliteration, ex- tified, which were less than we had expected. This isting foreign words in Korean are not necessarily was partially due to the fact that newspaper articles regulated by these rules. generally do not contain a large number of foreign Finally, we consult a dictionary to discard exist- words, compared with technical publications. ing Korean words, because our purpose is to extract We manually classified the extracted words and new words. For this purpose, we experimentally used only the words that were imported to both use the dictionary for SuperMorph-K morphologi- Japan and Korea from other languages. We dis- cal analyzer1, which includes approximately 50,000 carded foreign words in Korea imported from Japan, Korean words. because these words were often spelled out by non- Katakanacharacters, such as Kanji (Chinese charac- 2.5 Computing Similarity ter). A sample of these words includes “Tokyo (the Given romanized Japanese and Korean words, we capital of Japan)”, “Heisei (the current Japanese compute the similarity between the two strings and era name)”, and “enko (personal connection)”. In select the pairs associated with the score above a addition, we discarded the foreign proper nouns for threshold as translations. We use a DP (dynamic which the human subject was not able to identify programming) matching method to identify the the source word. As a result, we obtained 67 target number of differences (i.e., insertion, deletion, and word types. Examples of original English words for substitution) between two strings, on a alphabet- these words are as follows: by-alphabet basis. digital, group, dollar, re-engineering, line, In principle, if two strings are associated with a polyester, Asia, service, class, card, com- smaller number of differences, the similarity between puter, brand, liter, hotel. thembecomesgreater. Forthispurpose, aDice-style Thus, our method can potentially be applied to coecient can be used. roughly a half of the foreign words in Korean text. However, while the use of consonants in translit- We used the Japanese words to extract plausi- eration is usually the same across languages, the ble foreign words from the target Korean corpus. use of vowels can vary significantly depending on We first romanized the corpus and extracted nouns the language. For example, the English word “sys- by removing post-position suxes. As a result, we tem” is romanized as /sisutemu/ and /siseutem/ obtained 3,106 words including all the 67 target in Japanese and Korean, respectively. Thus, the dif- words. By discarding the words in the dictionary ferences in consonants between two strings should for SuperMorph-K, 958 words including 59 target be penalized more than the differences in vowels. words were remained. In view of the above discussion, we compute the Foreachoftheremaining958words,wecomputed similarity between two romanized words by Equa- the similarity between each of the 111,166 Japanese tion (1). words. For evaluation purposes, we varied a thresh- 1−2·(α·dc+dv) (1) old for the similarity and investigated the relation α·c+v between precision and recall. Recall is the ratio Here, dc and dv denote the numbers of differences of the number of target foreign words extracted by in consonants and vowels, respectively, and α is a our method and the total number of target foreign 1http://www.omronsoft.com/ 2http://research.nii.ac.jp/ntcir/index-en.html 74 CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology words. Precision is the ratio of the number of target 4 Related Work foreign words extracted by our method and the total Anumberof corpus-based methods to extract bilin- number of words obtained by our method. gual lexicons have been proposed (Smadja et al., Table 1 shows the precision and recall for differ- 1996). In general, these methods use statistics ob- ent methods. While we varied a threshold of a sim- tained from a parallel or comparable bilingual corpus ilarity, we also varied the number of Korean words and extract word or phrase pairs that are strongly corresponded to a single Katakana word (N). By associated with each other. However, our method decreasing the value of the threshold and increasing uses a monolingual Korean corpus and a Japanese the number of words extracted, the recall can be im- lexicon independent of the corpus, which can easily proved but the precision decreases. In Table 1, the be obtained, compared with parallel or comparable precision and recall are in an extreme trade-off rela- bilingual corpora. tion. For example, when the recall was 69.5%, the Jeong et al. (1999) and Oh and Choi (2001) in- precision was only 1.2%. dependently explored a statistical approach to de- Wemanuallyanalyzedthewordsthatwerenotex- tect foreign words in Korean text. Although the de- tracted by our method. Out of the 59 target words, tection accuracy is reasonably high, these methods 12 compound words consisting of both conventional require a training corpus in which conventional and and foreign words were not extracted. However, foreign words are annotated. Our approach does not our method extracted compound words consisting require annotated corpora, but the detection accu- of only foreign words. In addition, the three words racy is not high enough as shown in Section 3.1. A that did not have counterparts in the input Japanese combination of both approaches is expected to com- words were not extracted. pensate the drawbacks of each approach. 5 Conclusion Table 1: Precision/Recall for term extraction. We proposed a method to extract foreign words, Threshold for similarity such as technical terms and proper nouns, from Ko- >0.9 >0.7 >0.5 rean corpora and produce a Japanese-Korean bilin- N=1 50.0/8.5 12.7/40.7 4.1/47.5 gual dictionary. Specific words, which have been N=10 50.0/8.5 7.4/47.5 1.2/69.5 imported into multiple countries, are usually spelled out by special phonetic alphabets, such as Katakana in Japanese and Hangul in Korean. Because extracting foreign words spelled out by 3.2 Application-Oriented Evaluation Katakana in Japanese lexicons and corpora can be performed with a high accuracy, our method ex- Duringthefirstexperiment,wedeterminedaspecific tracts words in Korean corpora that are phonetically threshold value for the similarity between Katakana similar to Japanese Katakana words. Our method and Hangul words and selected the pairs whose sim- doesnotrequireparallelorcomparablebilingualcor- ilarity was above the threshold. As a result, we ob- pora and human annotation for these corpora. tained 667 Korean words, which were used to en- We also performed experiments in which we ex- hancethedictionary for the SuperMorph-K morpho- tracted foreign words from Korean newspaper arti- logical analyzer. cles and used the resultant dictionary for morpho- We performed morphological analysis on the 50 logical analysis. We found that our method did not articles used in the first experiment, which included correctly extract compound Korean words consist- 1,213 sentences and 9,557 word tokens. We also in- ing of both conventional and foreign words. Future vestigated the degree to which the analytical accu- work includes larger-scale experiments to further in- racy is improved by means of the additional dictio- vestigate the effectiveness of our method. nary. Here, accuracy is the ratio of the number of correct word segmentations and the total segmenta- References tions generated by SuperMorph-K. The same human Kil Soon Jeong, Sung Hyon Myaeng, Jae Sung Lee, subject as in the first experiment identified the cor- and Key-Sun Choi. 1999. Automatic identification rect word segmentations for the input articles. and back-transliteration of foreign words for informa- First, we focused on the accuracy of segmenting tion retrieval. Information Processing & Management, foreign words. The accuracy was improved from 35:523–540. 75.8% to 79.8% by means of the additional dictio- Jong-Hoon Oh and Key sun Choi. 2001. Automatic nary. The accuracy for all words was changed from extraction of transliterated foreign words using hid- 94.6% to 94.8% by the additional dictionary. den markov model. In Proceedings of ICCPOL-2001, In summary, the additional dictionary was effec- pages 433–438. Frank Smadja, Kathleen R. McKeown, and Vasileios tive for analyzing foreign words and was not asso- Hatzivassiloglou. 1996. Translating collocations for ciated with side effect for the overall accuracy. At bilingual lexicons: A statistical approach. Computa- the same time, we concede that we need larger-scale tional Linguistics, 22(1):1–38. experiments to draw firmer conclusions.
no reviews yet
Please Login to review.