112x Filetype PDF File size 0.32 MB Source: aclanthology.org
The Transliteration from Alphabet Queries to Japanese Product Names a a a Rieko Tsuji , Yoshinori Nemoto , Wimvipa Luangpiensamut , Yuji a a a b a a Abe , Takeshi Kimura , Kanako Komiya , Koji Fujimoto , Yoshiyuki Kotani Department of Computer and Information Science, Tokyo University of Agriculture and Technology / 2-24-16 Nakamachi Koganei-shi Tokyo JAPAN bTensor Consulting/ 2-10-1 Koujimachi Chiyoda-ku Tokyo JAPAN {Riekon.m, wimvipa, kittykimura}@gmail.com, 50012646127@st.tuat.ac.jp,wizdomowl@yahoo.co.jp, koji.fujimoto@tensor.co.jp, {kkomiya, kotani}@cc.tuat.ac.jp Abstract However, sometimes it is not easy for foreign buyers to find the products they want because of There are some cases where the non- the language difference. In our case, the alphabetic Japanese buyers are unable to find products queries that are input by non-Japanese buyers they want through the Japanese shopping should be translated into Japanese to show product Web sites because they require Japanese pages which they want to find. queries. We propose to transliterate the There are many cases that non-Japanese people inputs of the non-Japanese user, i.e., search get no or wrong result from their research queries queries written in English alphabets, into and they are classified into three cases. The first is Japanese Katakana to solve this problem. the case where the non-Japanese people write In this research, the pairs of the non- Japanese product names in alphabets and we Japanese search query which failed to get expected that this case would be solved by the right match obtained from a Japanese transliteration. The second is the case where non- shopping website and its transcribed word Japanese people write English product names and given by volunteers were used for the this would be solved by translation. The final is the training data. Since this corpus includes others, for example, the proper nouns such as the some noise for transliteration such as the names of the animation characters etc., and the free translation, we used two different misspellings. Among them, we expected that the filters to filter out the query pairs that are first case is the most frequent because 53.7% of not transliterlated in order to improve the them could be fully transliterated in the corpus. quality of the training data. In addition, we Hence, we propose the transliteration from the compared three methods, BIGRAM, HMM, alphabetic queries to Japanese product names cf., and CRF, using these data to investigate from lunchbox to “ランチボックス (translation which is the best for the query into English: lunchbox, pronunciation in Japanese: transliteration. The experiment revealed ranchibokkusu)”. that the HMM was the best. Also, many researches about transliteration have been accomplished for clean data, however, as far as we know, there have been no research about 1 Introduction transliteration for noisy query data. Thus, we In recent years, e-commerce is widely used investigated which method is the best for query throughout the world and it enables people to transliteration, using the parallel data of the purchase products from foreign countries. alphabetic queries which did not provide any products when non-Japanese people searched (i.e., 456 Copyright 2012 by Rieko Tsuji, Yoshinori Nemoto, Wimvipa Luangpensamut, Yuji Abe, Takeshi Kimura, Kanako Komiya Copyright 2012 by Rieko Tsuji, Yoshinori Nemoto, Wimvipa Luangpiensamut, Yuji Abe, Takeshi Kimura, Kanako Komiya, Koji Fujimoto, and Yoshiyuki Kotani 26th Pacific Asia Conference on Language,Information and Computation pages 456–462 Koji Fujimoto, and Yoshiyuki Kotani 26 th Pacific Asia Conference on Language, Information and Computation pages 456-462 the Alphabet Queries) and the Japanese queries Thus, we employed the phonemic approach and the which are transcribed from them (i.e., the Correct probabilistic method or the machine learning was Queries). We refer to this parallel data as the pair used for the transliteration from phonemes to corpus and Table1 shows the examples of it. Here, Japanese product names (i.e., the Correct Queries). the Alphabet Queries are the keywords which were actually used by non-Japanese user on a Japanese 3 Transformation from the Alphabet website and the Correct Queries were transcribed Query to Phoneme by volunteers. However, some pairs of them were not transliterated into Japanese phonogram, i.e., We employed the phonemic approach; the Katakana or Hiragana; they also have some free Alphabet Queries were transformed into phonemes translations or Chinese characters. Instead of and then are transliterated. The transliteration was manually editing the raw data, we automatically carried out as follows: filter out those word pairs using two filters: Chinese character filter (CF) and Chinese character 1. Transform the Alphabet Queries into and alphabet filter (CAF). The experiments phonemes using a English-Phoneme revealed that the HMM worked the best which dictionary (Section 3.1) gave precision of 0.448 when the CF was used for the looser evaluation. 2. Filter the Correct Queries to clean the noisy data (Section 3.2) 2 Related Works 3. Calculate the translation probabilities from phonemes to Japanese characters (Section Many works on transliteration have been 3.3) accomplished so far including phonemic, 4. Align the phonemes and Japanese orthographic, rule based approaches, and characters (Section 3.4) approaches which use machine learning. For example, Aramaki et al. (2009) presented the 5. Transliterate the phoneme queries into discriminative transliteration model using the CRF Japanese words using the probabilistic with the English-to-Japanese transliteration. In method or machine learning (Section 3.5) other language, Wang et al. (2011) worked on the English-Korean translation. They compared four methods: grapheme substring-based, phoneme The remainder of this section describes these five substring-based, rule-based and mixture of them. steps. The steps from one to four were the Jing et al. (2011) developed the English-Chinese generation phase of the training data and the step transliteration, which consists of many-to-many five was the transliteration phase. alignment and the CRF (conditional random fields) using accessor variety. 3.1 Transform the Alphabet Queries However, as far as we know, the transliteration using noisy query data has not been accomplished CMU Pronunciation Dictionary1 (CMUdict) was so far. Hence, we propose to transliterate the used for the transformation from the Alphabet Alphabet Queries into the Correct Queries using Queries to phonemes. Thus, we targeted only the the pair corpus and compared three transliteration alphabetic queries which include at least one methods to investigate which is the best for query phoneme in it. We obtained 2833 Alphabet Queries transliteration. after this process. It is also possible to use the dictionary-based 3.2 Filter approaches, however, the pair corpus includes many new words like the title of the comics and Since the pair corpus is noisy, the training data the names of the animation characters that are not were narrowed down and were refined using the listed in the dictionaries. Therefore, the dictionary following two different filters: based approach is not so powerful for transliteration comparing with that for translation. 1http://www. speech.cs.cmu.edu/cgi-bin/cmudict 457 method BASE BIGRAM HMM CRF system output フャブーンク ファブリック ファブリック フブック (fabuunku) (faburikku: (faburikku: (fubukku) the correct answer) the correct answer) evaluation 1 3 3 2 Table 2: The system output when the input was “fabric” (Alphabet Query) and evaluation 1. Chinese character filter (CF) Correct Query translit (translation into eration Type of 2. Chinese character and alphabet filter Alphabet Query English, (L) Characters (CAF) (type of query) pronunciation in or of Correct Japanese ) translat Query These two filters were compared to adjust the ion(T) quality and the amount of the training data. CF Doraemon ドラえもん Katakana, filtered out the pair which has Chinese character (animation’s (Doraemon, L Hiragana Correct Queries and CAF filtered out the pair character name) doraemon) which has Chinese character Correct Queries and Miyazaki ジブリ alphabetic Correct Queries. In other word, the pair (person's name) (GHIBRI, T Katakana filtered by CFA has only Katakana and Hiragana ziburi) Correct Query AKB48 poster AKB48 ポスター Katakana, Table 1 lists the example of the pair corpus and (pop group’s (AKB48 poster, L Alphabet the characteristics of the Alphabet and Correct name, poster) eikeibii48 posutaa) Queries. Here, we focused on the character type Ufm rod Ufm ロッド Katakana, of the Correct Queries because of the (brand name, (Ufm rod, L Alphabet characteristics of the pair corpus. rod) uefuemu roddo,) As shown in the table, although we want to use Tokyo adidas 東京 adidas Chinese only the transliteration pairs as the training data, it (place name, (Tokyo adidas, L character, is not easy to distinguish them. (The pair corpus brand name) toukyou adidasu) Alphabet consists of only the Alphabet and Correct Queries.) Dress Tokyo 原宿 ドレス Chinese The first problem was that some Correct Queries (general noun, (Harajuku dress, L, T character, are written not only in Japanese phonogram, i.e., place name) Harajuku doresu) Katakana Katakana or Hiragana, but also in ideograms, i.e., Table 1: The example of the pair corpus and the Chinese characters that have many ways to characteristics of the Alphabet and Correct Queries pronounce (cf. Tokyo-東京 (Tokyo,toukyou)). Thus, we carried out the filtering by the character Here, we filtered out the pair which has types to obtain as many transliteration pairs as alphabetic or Chinese character Correct Queries to possible. We expected that this process would refine the pair corpus more (CAF: The shaded data improve the quality of the training data because in with light gray and the shaded data with gray were many cases, if the Correct Queries were in removed). However, if we filter out too many Katakana, they were transliterated. However, we query pairs to improve the quality of the training have to keep in mind that the Correct Queries in data, we may not be able to obtain enough training Katakana could be free translation as shown in data for the probabilistic methods or machine Table1 on the second line (cf. Miyazaki –ジブリ learning. Therefore, we filtered out the pair corpus (translation into English: GHIBRI, pronunciation which has Chinese character Correct Queries (CF: in Japanese: ziburi, meaning: a film studio name) . The shaded data with gray were removed). Namely, we used two kinds of filters to find out which of those is the best for query transliteration. 458 We could use 78.5% and 25.2% of the pair Figure 2 shows the result of the alignment when corpus to calculate the translation probabilities by the Alphabet Queries was document and the using the CF and the CAF, respectively. Correct Queries was ドキュメント (document, 3.3 Calculation of Translation Probabilities dokyumento). NULLJ and NULLP in Figure 2 represent the alignments in the horizontal and The transliteration probabilities, from the vertical directions respectively. phonemes of the Alphabet Queries which were transformed in Section 3.1 to the Correct Queries [D -ド(do)] which were filtered in Section3.2, were [AAI - NULLJ] calculated using the filtered pair corpus. We used [K -キ(ki)] the GIZA++2 toolkit (Och and Ney, 2003) to [Y -ュ(yu)] calculate them. Here, we set phonemes as the [AH0 - NULLJ] source language and Japanese character as the [M -メ(me)] target language. [EH0 - NULLJ] 3.4 Alignment [N -ン(n)] [T -ト(to)] The alignment of phonemes and Japanese characters which is necessary before the transliteration was carried out for each query pair. Figure 2: The result of the alignment of the The Dijkstra algorithm was used to make phonemes of document and ド キ ュ メ ン ト alignments. Fig.1 shows the alignment of the (document, dokyumento) phonemes of document and its transcribed wordド キュメント (document, dokyumento). In Fig 1, 3.5 Transliteration the horizontal axis represents the phonemes of the The transliteration was carried out using the Alphabet Queries and the vertical axis represents probabilistic method or machine learning. We the Correct Queries. We used the negative compared the following three different approaches logarithm of the translation probabilities (which were applied based on the alignments which were are calculated in Section3.3) as costs of the obtained in Section 3.4: alignment. Also, we set logarithm of 10-20 as the cost when no translation probabilities were 1. BIGRAM: The Bigram Model obtained. (cf., the horizontal direction and vertical direction in Fig 1 are the cases). 2. HMM: The Hidden Marcov Model 3. CRF: The CRF model 3 We used NLTK for BIGRAM and the HMM and adopted the CRF++4 toolkit for the CRF. We trained the CRF models with the unigram, bigram, and trigram features. The features are shown in the following. Unigram: s-2, s-1, s0, s1, and s2 Bigram: s-1s0 and s0s1 Trigram: s−2s−1s0, s−1s0s1, and s0s1s2 Figure 1: The alignment of the phonemes of We set parameters as f=50 and c=2. We set f=50 document and its transcribed word ドキュメント because the kinds of features were so variable. (document, dokyumento) 3 http://www.nltk.org/ 2 http://www.fjoch.com/GIZA++.html 4 http://crfpp.googlecode.com/svn/trunk/doc/index.html 459
no reviews yet
Please Login to review.