210x Filetype PDF File size 0.12 MB Source: aclanthology.org
CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology 71
Term Extraction from Korean Corpora via Japanese
Atsushi Fujii, Tetsuya Ishikawa Jong-Hyeok Lee
Graduate School of Library, Division of Electrical and
Information and Media Studies Computer Engineering,
University of Tsukuba Pohang University of Science and Technology,
1-2 Kasuga, Tsukuba Advanced Information Technology Research Center
305-8550, Japan San 31 Hyoja-dong Nam-gu,
{fujii,ishikawa}@slis.tsukuba.ac.jp Pohang 790-784, Republic of Korea
jhlee@postech.ac.kr
Abstract large number of foreign words in Korean can also be
This paper proposes a method to extract foreign foreign words in Japanese.
words, such as technical terms and proper nouns, In addition, the foreign words in Korean and
from Korean corpora and produce a Japanese- Japanese corresponding to the same source word are
Korean bilingual dictionary. Specific words have phonetically similar. For example, the English word
been imported into multiple countries simultane- “system” has been imported into both Japanese and
ously, if they are influential across cultures. The Korean. The romanized words are /sisutemu/and
pronunciation of a source word is similar in different /siseutem/ in both countries, respectively.
languages. Our method extracts words in Korean Motivated by these assumptions, we propose a
corpora that are phonetically similar to Katakana method to extract foreign words in Korean corpora
words, whichcaneasilybeidentifiedinJapanesecor- by means of Japanese. In brief, our method per-
pora. We also show the effectiveness of our method forms as follows. First, foreign words in Japanese
by means of experiments. are collected, for which Katakana words in corpora
and existing lexicons can be used. Second, from Ko-
1 Introduction rean corpora the words that are phonetically similar
to Katakana words are extracted. Finally, extracted
Reflecting the rapid growth in science and tech- Koreanwordsarecompiledinalexiconwiththecor-
nology, new words have progressively been created. responding Japanese words.
However, due to the limitation of manual compila- Insummary,ourmethodcanextractforeignwords
tion, new words are often out-of-dictionary words in Korean and produce a Japanese-Korean bilingual
and decrease the quality of human language tech- lexicon in a single framework.
nology, such as natural language processing, infor-
mation retrieval, machine translation, and speech 2 Methodology
recognition. To resolve this problem, a number 2.1 Overview
of automatic methods to extract monolingual and
bilingual lexicons from corpora have been proposed Figure 1 exemplifies our extraction method, which
for various languages. produces a Japanese-Korean bilingual lexicon using
In this paper, we focus on extracting foreign words a Korean corpus and Japanese corpus and/or lexi-
(or loanwords) in Korean. Technical terms and con. The Japanese and Korean corpora do not have
proper nouns are often imported from foreign lan- to be parallel or comparable. However, it is desir-
guages and are spelled out (or transliterated) by the able that both corpora are associated with the same
Korean alphabet system called Hangul. The similar domain. For the Japanese resource, the corpus and
trend can be observable in Japanese and Chinese. In lexicon can alternatively be used or can be used to-
Japanese, foreign words are spelled out by its special gether. Note that compiling Japanese monolingual
phonetic alphabet (or phonogram) called Katakana. lexicon is less expensive than that for a bilingual lex-
Thus, foreign words can be extracted from Japanese icon. In addition, new Katakana words can easily be
corpora with a high accuracy, because the Katakana extracted from a number of on-line resources, such
characters are seldom used to describe the conven- as the World Wide Web. Thus, the use of Japanese
tional Japanese words, excepting proper nouns. lexicons does not decrease the utility of our method.
However, extracting foreign words from Korean First, we collect Katakana words from Japanese
corpora is more dicult, because in Korean both resources. This can systematically be performed by
the conventional and foreign words are written with means of a Japanese character code, such as EUC-
Hangul characters. This problem remains a chal- JP and SJIS.
lenging issue in computational linguistic research. Second, we represent the Korean corpus and
It is often the case that specific words have been Japanese Katakana words by the Roman alphabet
imported into multiple countries simultaneously, be- (i.e., romanization), so that the phonetic similarity
cause the source words (or concepts) are usually in- can easily be computed. However, we use different
fluential across cultures. Thus, it is feasible that a romanization methods for Japanese and Korean.
72 CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology
Third, we extract candidates of foreign words 2.3 Romanizing Korean
from the romanized Korean corpus. An alternative The number of Korean Hangul characters is much
method is to first perform morphological analysis greater than that of Japanese Katakana characters.
on the corpus, extract candidate words based on Each Hangul character is a combination of more
morphemes and parts-of-speech, and romanize the thanoneconsonant. Thepronunciationofeachchar-
extracted words. Our general model does not con- acter is determined by its component consonants.
strain as to which method should be used in the In Korean, there are types of consonant, i.e., the
third step. However, because the accuracy of anal- first consonant, vowel, and last consonant. The
ysis often decreases for new words to be extracted, numbers of these consonants are 19, 21, and 27, re-
we experimentally adopt the former method. spectively. The last consonant is optional. Thus, the
Finally, we compute the phonetic similarity be- number of combined characters is 11,172. However,
tween each combination of the romanized Hangul to transliterate imported words, the ocial guide-
and Katakana words, and select the combinations line suggests that only seven consonants be used as
whose score is above a predefined threshold. As a the last consonant. In EUC-KR, which is a stan-
result, we can obtain a Japanese-Korean bilingual dard coding system for Korean text, 2,350 common
lexicon consisting of foreign words. characters are coded independent of the pronunci-
It may be argued that English lexicons or cor- ation. Therefore, if we target corpora represented
pora can be used as source information, instead of by EUC-KR, each of the 2,350 characters has to be
Japanese resources. However, because not all En- corresponded to its Roman representation.
glish words have been imported into Korean, the We use Unicode, in which Hangul characters are
extraction accuracy will decrease due to extraneous sorted according to the pronunciation. Figure 2 de-
words. picts a fragment of the Unicode table for Korean,
in which each line corresponds to a combination
of the first consonant and vowel and each column
corresponds to the last consonant. The number of
columnsis 28, i.e., the number of the last consonants
and the case in which the last consonant is not used.
From this figure, the following rules can be found:
• thefirstconsonantchangesevery21lines, which
corresponds to the number of vowels,
• the vowel changes every line (i.e., 28 characters)
and repeats every 21 lines,
• the last consonant changes every column.
Based on these rules, each character and its pro-
nunciation can be identified by the three consonant
types. Thus, we manually corresponded only the 68
consonants to Roman alphabets.
Figure 1: Overview of our extraction method.
2.2 Romanizing Japanese
BecausethenumberofphonesconsistingofJapanese
Katakana characters is limited, we manually pro-
duced the correspondence between each phone
and its Roman representation. The numbers of
Katakana characters and combined phones are 73 Figure 2: A fragment of the Unicode table for Ko-
and 109, respectively. We also defined a symbol to rean Hangul characters.
represent a long vowel. In Japanese, the Hepbern
and Kunrei systems are commonly used for roman-
ization purposes. We use the Hepburn system, be- We use the ocial romanization system for Ko-
cause its representation is similar to that in Korean, rean, but specific Korean phones are adapted to
compared with the Kunrei system. Japanese. For example, /j/ and /l/ are converted
However, specific Japanese phones, such as /ti/, to /z/ and /r/, respectively.
do not exist in Korean. Thus, to adapt the Hepburn It should be noted that the adaptation is not in-
system to Korean, /ti/ and /tu/ are converted to vertible and thus is needed for both J-to-K and K-
/chi/ and /chu/, respectively. to-J directions.
CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology 73
For example, the English word “cheese”, which parametric constant used to control the importance
has been imported to both Korean and Japanese as of the consonants. We experimentally set α =2.In
a foreign word, is romanized as /chiseu/ in Korean addition, c and v denote the numbers of all conso-
and /ti:zu/ in Japanese. Here, /:/isthesymbol nants and vowels in the two strings. The similarity
representing a Japanese long vowel. Using the adap- ranges from 0 to 1.
tation, these expressions are converted to /chizu/
and /chi:zu/, respectively, which look more similar 3 Experimentation
to each other, compared with the original strings. 3.1 Evaluating Extraction Accuracy
2.4 Extracting term candidates from We collected 111,166 Katakana words (word types)
Korean corpora from multiple Japanese lexicons, most of which were
To extract candidates of foreign words from a Ko- technical term dictionaries.
rean corpus, we first extract phrases. This can be WeusedtheKoreandocumentsetintheNTCIR-3
performed systematically, because Korean sentences Cross-lingual Information Retrieval test collection2.
are segmented on a phrase-by-phrase basis. This document set consists of 66,146 newspaper ar-
Second, because foreign words are usually nouns, ticles of Korean Economic Daily published in 1994.
we use hand-crafted rules to remove post-position We randomly selected 50 newspaper articles and
suxes (e.g., Josa) and extract nouns from phrases. used them for our experiment. We asked a grad-
Third, we discard nouns including the last con- uate student excluding the authors of this paper to
sonants that are not recommended for translitera- identify foreign words in the target text. As a result,
tion purposes in the ocial guideline. Although the 124 foreign word types (205 word tokens) were iden-
guideline suggests other rules for transliteration, ex- tified, which were less than we had expected. This
isting foreign words in Korean are not necessarily was partially due to the fact that newspaper articles
regulated by these rules. generally do not contain a large number of foreign
Finally, we consult a dictionary to discard exist- words, compared with technical publications.
ing Korean words, because our purpose is to extract We manually classified the extracted words and
new words. For this purpose, we experimentally used only the words that were imported to both
use the dictionary for SuperMorph-K morphologi- Japan and Korea from other languages. We dis-
cal analyzer1, which includes approximately 50,000 carded foreign words in Korea imported from Japan,
Korean words. because these words were often spelled out by non-
Katakanacharacters, such as Kanji (Chinese charac-
2.5 Computing Similarity ter). A sample of these words includes “Tokyo (the
Given romanized Japanese and Korean words, we capital of Japan)”, “Heisei (the current Japanese
compute the similarity between the two strings and era name)”, and “enko (personal connection)”. In
select the pairs associated with the score above a addition, we discarded the foreign proper nouns for
threshold as translations. We use a DP (dynamic which the human subject was not able to identify
programming) matching method to identify the the source word. As a result, we obtained 67 target
number of differences (i.e., insertion, deletion, and word types. Examples of original English words for
substitution) between two strings, on a alphabet- these words are as follows:
by-alphabet basis. digital, group, dollar, re-engineering, line,
In principle, if two strings are associated with a polyester, Asia, service, class, card, com-
smaller number of differences, the similarity between puter, brand, liter, hotel.
thembecomesgreater. Forthispurpose, aDice-style Thus, our method can potentially be applied to
coecient can be used. roughly a half of the foreign words in Korean text.
However, while the use of consonants in translit- We used the Japanese words to extract plausi-
eration is usually the same across languages, the ble foreign words from the target Korean corpus.
use of vowels can vary significantly depending on We first romanized the corpus and extracted nouns
the language. For example, the English word “sys- by removing post-position suxes. As a result, we
tem” is romanized as /sisutemu/ and /siseutem/ obtained 3,106 words including all the 67 target
in Japanese and Korean, respectively. Thus, the dif- words. By discarding the words in the dictionary
ferences in consonants between two strings should for SuperMorph-K, 958 words including 59 target
be penalized more than the differences in vowels. words were remained.
In view of the above discussion, we compute the Foreachoftheremaining958words,wecomputed
similarity between two romanized words by Equa- the similarity between each of the 111,166 Japanese
tion (1). words. For evaluation purposes, we varied a thresh-
1−2·(α·dc+dv) (1) old for the similarity and investigated the relation
α·c+v between precision and recall. Recall is the ratio
Here, dc and dv denote the numbers of differences of the number of target foreign words extracted by
in consonants and vowels, respectively, and α is a our method and the total number of target foreign
1http://www.omronsoft.com/ 2http://research.nii.ac.jp/ntcir/index-en.html
74 CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology
words. Precision is the ratio of the number of target 4 Related Work
foreign words extracted by our method and the total Anumberof corpus-based methods to extract bilin-
number of words obtained by our method. gual lexicons have been proposed (Smadja et al.,
Table 1 shows the precision and recall for differ- 1996). In general, these methods use statistics ob-
ent methods. While we varied a threshold of a sim- tained from a parallel or comparable bilingual corpus
ilarity, we also varied the number of Korean words and extract word or phrase pairs that are strongly
corresponded to a single Katakana word (N). By associated with each other. However, our method
decreasing the value of the threshold and increasing uses a monolingual Korean corpus and a Japanese
the number of words extracted, the recall can be im- lexicon independent of the corpus, which can easily
proved but the precision decreases. In Table 1, the be obtained, compared with parallel or comparable
precision and recall are in an extreme trade-off rela- bilingual corpora.
tion. For example, when the recall was 69.5%, the Jeong et al. (1999) and Oh and Choi (2001) in-
precision was only 1.2%. dependently explored a statistical approach to de-
Wemanuallyanalyzedthewordsthatwerenotex- tect foreign words in Korean text. Although the de-
tracted by our method. Out of the 59 target words, tection accuracy is reasonably high, these methods
12 compound words consisting of both conventional require a training corpus in which conventional and
and foreign words were not extracted. However, foreign words are annotated. Our approach does not
our method extracted compound words consisting require annotated corpora, but the detection accu-
of only foreign words. In addition, the three words racy is not high enough as shown in Section 3.1. A
that did not have counterparts in the input Japanese combination of both approaches is expected to com-
words were not extracted. pensate the drawbacks of each approach.
5 Conclusion
Table 1: Precision/Recall for term extraction. We proposed a method to extract foreign words,
Threshold for similarity such as technical terms and proper nouns, from Ko-
>0.9 >0.7 >0.5 rean corpora and produce a Japanese-Korean bilin-
N=1 50.0/8.5 12.7/40.7 4.1/47.5 gual dictionary. Specific words, which have been
N=10 50.0/8.5 7.4/47.5 1.2/69.5 imported into multiple countries, are usually spelled
out by special phonetic alphabets, such as Katakana
in Japanese and Hangul in Korean.
Because extracting foreign words spelled out by
3.2 Application-Oriented Evaluation Katakana in Japanese lexicons and corpora can be
performed with a high accuracy, our method ex-
Duringthefirstexperiment,wedeterminedaspecific tracts words in Korean corpora that are phonetically
threshold value for the similarity between Katakana similar to Japanese Katakana words. Our method
and Hangul words and selected the pairs whose sim- doesnotrequireparallelorcomparablebilingualcor-
ilarity was above the threshold. As a result, we ob- pora and human annotation for these corpora.
tained 667 Korean words, which were used to en- We also performed experiments in which we ex-
hancethedictionary for the SuperMorph-K morpho- tracted foreign words from Korean newspaper arti-
logical analyzer. cles and used the resultant dictionary for morpho-
We performed morphological analysis on the 50 logical analysis. We found that our method did not
articles used in the first experiment, which included correctly extract compound Korean words consist-
1,213 sentences and 9,557 word tokens. We also in- ing of both conventional and foreign words. Future
vestigated the degree to which the analytical accu- work includes larger-scale experiments to further in-
racy is improved by means of the additional dictio- vestigate the effectiveness of our method.
nary. Here, accuracy is the ratio of the number of
correct word segmentations and the total segmenta- References
tions generated by SuperMorph-K. The same human Kil Soon Jeong, Sung Hyon Myaeng, Jae Sung Lee,
subject as in the first experiment identified the cor- and Key-Sun Choi. 1999. Automatic identification
rect word segmentations for the input articles. and back-transliteration of foreign words for informa-
First, we focused on the accuracy of segmenting tion retrieval. Information Processing & Management,
foreign words. The accuracy was improved from 35:523–540.
75.8% to 79.8% by means of the additional dictio- Jong-Hoon Oh and Key sun Choi. 2001. Automatic
nary. The accuracy for all words was changed from extraction of transliterated foreign words using hid-
94.6% to 94.8% by the additional dictionary. den markov model. In Proceedings of ICCPOL-2001,
In summary, the additional dictionary was effec- pages 433–438.
Frank Smadja, Kathleen R. McKeown, and Vasileios
tive for analyzing foreign words and was not asso- Hatzivassiloglou. 1996. Translating collocations for
ciated with side effect for the overall accuracy. At bilingual lexicons: A statistical approach. Computa-
the same time, we concede that we need larger-scale tional Linguistics, 22(1):1–38.
experiments to draw firmer conclusions.
no reviews yet
Please Login to review.