Language Pdf 98411

Partial capture of text on file.
                   The Transliteration from Alphabet Queries to Japanese Product Names 
                                                 a                         a                                   a
                                   Rieko Tsuji , Yoshinori Nemoto , Wimvipa Luangpiensamut , Yuji 
                                 a                      a                      a                   b                        a
                         a  Abe , Takeshi Kimura , Kanako Komiya , Koji Fujimoto , Yoshiyuki Kotani  
                          Department of Computer and Information Science, Tokyo University of Agriculture and 
                                        Technology / 2-24-16 Nakamachi Koganei-shi Tokyo JAPAN 
                                     bTensor Consulting/ 2-10-1 Koujimachi Chiyoda-ku Tokyo JAPAN 
                                       {Riekon.m, wimvipa, kittykimura}@gmail.com, 
                                  50012646127@st.tuat.ac.jp,wizdomowl@yahoo.co.jp, 
                        koji.fujimoto@tensor.co.jp, {kkomiya, kotani}@cc.tuat.ac.jp 
                                                                             
                                                                            
                            
                   
                   
                                        Abstract                             However,  sometimes  it  is  not  easy  for  foreign 
                                                                             buyers to find the products they want because of 
                      There  are  some  cases  where  the  non-              the language difference. In our case, the alphabetic 
                      Japanese buyers are unable to find products            queries  that  are  input  by  non-Japanese  buyers 
                      they  want  through  the  Japanese  shopping           should be translated into Japanese to show product 
                      Web  sites  because  they  require  Japanese           pages which they want to find.  
                      queries.  We  propose  to  transliterate  the             There are many cases that non-Japanese people 
                      inputs of the non-Japanese user, i.e., search          get no or wrong result from their research queries 
                      queries  written  in  English  alphabets,  into        and they are classified into three cases. The first is 
                      Japanese  Katakana  to  solve  this  problem.          the  case  where  the  non-Japanese  people  write 
                      In  this  research,  the  pairs  of  the  non-         Japanese  product  names  in  alphabets  and  we 
                      Japanese search query which failed to get              expected  that  this  case  would  be  solved  by 
                      the  right  match obtained from a Japanese             transliteration. The second is the case where non-
                      shopping website and its transcribed word              Japanese people write English product names and 
                      given  by  volunteers  were  used  for  the            this would be solved by translation. The final is the 
                      training  data.  Since  this  corpus  includes         others, for example, the proper nouns such as the 
                      some noise for transliteration  such  as  the          names  of  the  animation  characters  etc.,  and  the 
                      free  translation,  we  used  two  different           misspellings. Among them, we expected that the 
                      filters to filter out the query pairs that are         first  case  is  the  most  frequent  because  53.7%  of 
                      not transliterlated in order to improve the            them  could  be  fully  transliterated  in  the  corpus. 
                      quality of the training data. In addition, we          Hence,  we  propose  the  transliteration  from  the 
                      compared three methods, BIGRAM, HMM,                   alphabetic queries to Japanese product names cf., 
                      and  CRF,  using  these  data  to  investigate         from lunchbox to “ランチボックス (translation 
                      which     is   the   best    for   the   query         into English: lunchbox, pronunciation in Japanese: 
                      transliteration.  The  experiment  revealed            ranchibokkusu)”.  
                      that the HMM was the best.                                Also, many researches about transliteration have 
                                                                             been accomplished for clean data, however, as far 
                                                                             as  we  know,  there  have  been  no  research  about 
                  1    Introduction                                          transliteration  for  noisy  query  data.  Thus,  we 
                  In  recent  years,  e-commerce  is  widely  used           investigated  which  method  is  the  best  for  query 
                  throughout  the  world  and  it  enables  people  to       transliteration,  using  the  parallel  data  of  the 
                  purchase     products     from    foreign    countries.    alphabetic  queries  which  did  not  provide  any 
                                                                             products when non-Japanese people searched (i.e., 
                                                                       456
                           Copyright 2012 by Rieko Tsuji, Yoshinori Nemoto, Wimvipa Luangpensamut, Yuji Abe, Takeshi Kimura, Kanako Komiya
  Copyright 2012 by Rieko Tsuji, Yoshinori Nemoto, Wimvipa Luangpiensamut, Yuji Abe, Takeshi Kimura, Kanako Komiya, Koji Fujimoto, and Yoshiyuki Kotani
                                  26th Paciﬁc Asia Conference on Language,Information and Computation pages 456–462
                                                                       Koji Fujimoto, and Yoshiyuki Kotani
                                         26 th Pacific Asia Conference on Language, Information and Computation pages 456-462
                   the  Alphabet  Queries)  and  the  Japanese  queries           Thus, we employed the phonemic approach and the 
                   which are transcribed from them (i.e., the Correct             probabilistic method or the machine learning was 
                   Queries). We refer to this parallel data as the pair           used  for  the  transliteration  from  phonemes  to 
                   corpus and Table1 shows the examples of it. Here,              Japanese product names (i.e., the Correct Queries). 
                   the Alphabet Queries are the keywords which were                
                   actually used by non-Japanese user on a Japanese               3     Transformation  from  the  Alphabet 
                   website and the Correct Queries were transcribed                     Query to Phoneme 
                   by volunteers. However, some pairs of them were 
                   not  transliterated  into  Japanese  phonogram,  i.e.,         We  employed  the  phonemic  approach;  the 
                   Katakana or Hiragana; they also have some free                 Alphabet Queries were transformed into phonemes 
                   translations  or  Chinese  characters.  Instead  of            and then are transliterated. The transliteration was 
                   manually  editing  the  raw  data,  we  automatically          carried out as follows: 
                   filter  out  those  word  pairs  using  two  filters:           
                   Chinese character filter (CF) and Chinese character                 1.  Transform  the  Alphabet  Queries  into 
                   and  alphabet  filter  (CAF).  The  experiments                          phonemes  using  a  English-Phoneme 
                   revealed  that  the  HMM  worked  the  best  which                       dictionary (Section 3.1) 
                   gave precision of 0.448 when the CF was used for 
                   the looser evaluation.                                              2.  Filter  the  Correct  Queries  to  clean  the 
                                                                                            noisy data (Section 3.2) 
                   2     Related Works                                                 3.  Calculate the translation probabilities from 
                                                                                            phonemes to Japanese characters (Section 
                   Many  works  on  transliteration  have  been                             3.3) 
                   accomplished        so    far    including     phonemic,            4.  Align     the    phonemes       and     Japanese 
                   orthographic,      rule    based      approaches,      and               characters (Section 3.4) 
                   approaches  which  use  machine  learning.  For 
                   example,  Aramaki  et  al.  (2009)  presented  the                  5.  Transliterate  the  phoneme  queries  into 
                   discriminative transliteration model using the CRF                       Japanese  words  using  the  probabilistic 
                   with  the  English-to-Japanese  transliteration.  In                     method or machine learning (Section 3.5) 
                   other language, Wang et al. (2011) worked on the                          
                   English-Korean  translation.  They  compared  four 
                   methods:  grapheme  substring-based,  phoneme                  The remainder of this section describes these five 
                   substring-based,  rule-based  and  mixture  of  them.          steps.  The  steps  from  one  to  four  were  the 
                   Jing  et  al.  (2011)  developed  the  English-Chinese         generation phase of the training data and the step 
                   transliteration,  which  consists  of  many-to-many            five was the transliteration phase. 
                   alignment and the CRF (conditional random fields)               
                   using accessor variety.                                        3.1    Transform the Alphabet Queries 
                        However, as far as we know, the transliteration 
                   using noisy query data has not been accomplished               CMU Pronunciation  Dictionary1 (CMUdict)  was 
                   so  far.  Hence,  we  propose  to  transliterate  the          used  for  the  transformation  from  the  Alphabet 
                   Alphabet  Queries  into  the  Correct  Queries  using          Queries to phonemes. Thus, we targeted only the 
                   the pair corpus and compared three transliteration             alphabetic  queries  which  include  at  least  one 
                   methods to investigate which is the best for query             phoneme in it. We obtained 2833 Alphabet Queries 
                   transliteration.                                               after this process. 
                        It  is  also  possible  to  use  the  dictionary-based    3.2    Filter 
                   approaches,  however,  the  pair  corpus  includes 
                   many new words like the title of the comics and                Since  the  pair  corpus  is  noisy,  the  training  data 
                   the names of the animation characters that are not             were narrowed down and were refined using the 
                   listed in the dictionaries. Therefore, the dictionary          following two different filters:  
                   based     approach      is   not    so    powerful      for 
                   transliteration comparing with that for translation.                                                                      
                                                                                  1http://www. speech.cs.cmu.edu/cgi-bin/cmudict 
                                                                           457
                                      method                         BASE                       BIGRAM                           HMM                            CRF 
                                  system output                フャブーンク  ファブリック  ファブリック                                                                       フブック 
                                                             (fabuunku)                   (faburikku:                   (faburikku:                   (fubukku) 
                                                                                          the correct answer)           the correct answer) 
                                    evaluation                           1                             3                            3                             2 
                                   Table 2: The system output when the input was “fabric” (Alphabet Query) and evaluation  
                                                                                                            
                          
                               1.  Chinese character filter (CF)                                                                      Correct Query          translit
                                                                                                                                     (translation into       eration        Type of 
                               2.  Chinese  character  and  alphabet  filter                                 Alphabet Query              English,              (L)        Characters 
                                      (CAF)                                                                  (type of query)         pronunciation in           or        of Correct 
                                                                                                                                        Japanese )           translat       Query 
                         These  two  filters  were  compared  to  adjust  the                                                                                 ion(T) 
                         quality  and  the  amount  of  the  training  data.  CF                                Doraemon               ドラえもん                              Katakana, 
                         filtered  out  the  pair  which  has  Chinese  character                              (animation’s            (Doraemon,               L          Hiragana 
                         Correct  Queries  and  CAF  filtered  out  the  pair                                character name)            doraemon) 
                         which has Chinese character Correct Queries  and                                       Miyazaki                  ジブリ 
                         alphabetic Correct Queries. In other word, the pair                                 (person's name)            (GHIBRI,                T          Katakana 
                         filtered by CFA has only Katakana and Hiragana                                                                   ziburi) 
                         Correct Query                                                                        AKB48 poster          AKB48 ポスター                            Katakana, 
                             Table 1 lists the example of the pair corpus and                                 (pop group’s           (AKB48 poster,             L          Alphabet 
                         the  characteristics  of  the  Alphabet  and  Correct                                name, poster)        eikeibii48  posutaa) 
                         Queries.  Here, we focused on the character type                                        Ufm rod               Ufm ロッド                            Katakana, 
                         of      the      Correct        Queries          because         of     the          (brand name,              (Ufm rod,               L          Alphabet 
                         characteristics of the pair corpus.                                                       rod)              uefuemu roddo,) 
                               As shown in the table, although we want to use                                 Tokyo adidas             東京 adidas                           Chinese 
                         only the transliteration pairs as the training data, it                              (place name,            (Tokyo adidas,            L          character, 
                         is  not easy to distinguish them. (The pair corpus                                    brand name)          toukyou adidasu)                       Alphabet 
                         consists of only the Alphabet and Correct Queries.)                                  Dress Tokyo             原宿  ドレス                              Chinese 
                         The first problem was that some Correct Queries                                      (general noun,         (Harajuku dress,          L, T        character, 
                         are written not only in Japanese phonogram, i.e.,                                     place name)          Harajuku doresu)                       Katakana 
                         Katakana or Hiragana, but also in ideograms, i.e.,                                Table 1: The example of the pair corpus and the 
                         Chinese  characters  that  have  many  ways  to                                   characteristics of the Alphabet and Correct Queries 
                         pronounce (cf. Tokyo-東京 (Tokyo,toukyou)).                                          
                            Thus, we carried out the filtering by the character                                  Here,  we  filtered  out  the  pair  which  has 
                         types  to  obtain  as  many  transliteration  pairs  as                           alphabetic or Chinese character Correct Queries to 
                         possible.  We  expected  that  this  process  would                               refine the pair corpus more (CAF: The shaded data 
                         improve the quality of the training data because in                               with light gray and the shaded data with gray were 
                         many  cases,  if  the  Correct  Queries  were  in                                 removed).  However,  if  we  filter  out  too  many 
                         Katakana,  they  were  transliterated.  However,  we                              query pairs to improve the quality of the training 
                         have to keep in mind that the Correct Queries in                                  data, we may not be able to obtain enough training 
                         Katakana  could  be  free  translation  as  shown  in                             data  for  the  probabilistic  methods  or  machine 
                         Table1 on the second line (cf. Miyazaki –ジブリ                                      learning.  Therefore, we filtered out the pair corpus 
                         (translation  into  English:  GHIBRI,  pronunciation                              which has Chinese character Correct Queries (CF: 
                         in Japanese: ziburi, meaning: a film studio name) .                               The shaded data with gray were removed). Namely, 
                                                                                                           we used two kinds of filters to find out which of 
                                                                                                           those is the best for query transliteration.  
                          
                          
                          
                                                                                                  458
                                   We  could  use  78.5%  and  25.2%  of  the  pair                                            Figure 2  shows the result of the alignment when 
                              corpus to calculate the translation probabilities by                                             the    Alphabet  Queries  was  document  and  the 
                              using the CF and the CAF, respectively.                                                          Correct  Queries  was ドキュメント (document, 
                              3.3        Calculation of Translation Probabilities                                              dokyumento).  NULLJ  and  NULLP  in  Figure  2 
                                                                                                                               represent  the  alignments  in  the  horizontal  and 
                              The         transliteration               probabilities,               from           the        vertical directions respectively. 
                              phonemes  of  the  Alphabet  Queries  which  were                                                 
                              transformed in Section 3.1 to the Correct Queries                                                     [D -ド(do)]                                
                              which  were                      filtered         in      Section3.2,  were                           [AAI - NULLJ]                             
                              calculated using the filtered pair corpus. We used                                                    [K -キ(ki)] 
                              the  GIZA++2 toolkit  (Och  and  Ney,  2003)  to                                                      [Y -ュ(yu)] 
                              calculate  them.  Here,  we  set  phonemes  as  the                                               
                                                                                                                                    [AH0 - NULLJ]                             
                              source  language  and  Japanese  character  as  the                                                   [M -メ(me)]                                
                              target language.                                                                                      [EH0 - NULLJ]                             
                              3.4        Alignment                                                                                  [N -ン(n)]                                 
                                                                                                                                    [T -ト(to)]                                
                              The  alignment  of  phonemes  and  Japanese                                                                                                     
                              characters             which  is               necessary             before           the         
                              transliteration was carried out for each query pair.                                             Figure  2:    The  result  of  the  alignment  of  the 
                              The  Dijkstra  algorithm  was  used  to  make                                                    phonemes  of  document  and ド キ ュ メ ン ト    
                              alignments.  Fig.1  shows  the  alignment  of  the                                               (document, dokyumento) 
                              phonemes of document and its transcribed wordド                                                    
                              キュメント (document, dokyumento). In Fig 1,                                                          3.5        Transliteration  
                              the horizontal axis represents the phonemes of the                                               The  transliteration  was  carried  out  using  the 
                              Alphabet Queries and the vertical axis represents                                                probabilistic  method  or  machine  learning.  We 
                              the  Correct  Queries.  We  used  the  negative                                                  compared the following three different approaches 
                              logarithm  of  the  translation  probabilities  (which                                           were applied based on the alignments which were 
                              are  calculated  in  Section3.3)  as  costs  of  the                                             obtained in Section 3.4:  
                              alignment. Also, we set logarithm of 10-20 as the                                                 
                              cost  when  no  translation  probabilities  were                                                        1.  BIGRAM: The Bigram Model  
                              obtained. (cf., the horizontal direction and vertical 
                              direction in Fig 1 are the cases).                                                                      2.  HMM: The Hidden Marcov Model  
                                                                                                                                      3.  CRF: The CRF model 
                                                                                                                                                            3
                                                                                                                               We used NLTK  for BIGRAM and the HMM and 
                                                                                                                               adopted  the CRF++4 toolkit  for  the  CRF.  We 
                                                                                                                               trained the CRF models with the unigram, bigram, 
                                                                                                                               and trigram features. The features are shown in the 
                                                                                                                               following. 
                                                                                                                                            Unigram: s-2, s-1, s0, s1, and s2 
                                                                                                                                            Bigram: s-1s0 and s0s1 
                                                                                                                                            Trigram: s−2s−1s0, s−1s0s1, and s0s1s2 
                              Figure  1:  The  alignment  of  the  phonemes  of                                                    We set parameters as f=50 and c=2. We set f=50 
                              document and its transcribed word ドキュメント                                                         because the kinds of features were so variable. 
                              (document, dokyumento) 
                               
                                                                                                                                                                                          
                                                                                                                               3 http://www.nltk.org/ 
                              2 http://www.fjoch.com/GIZA++.html                                                               4 http://crfpp.googlecode.com/svn/trunk/doc/index.html 
                                                                                                                     459
The words contained in this file might help you see if this file matches what you are looking for:

...The transliteration from alphabet queries to japanese product names a rieko tsuji yoshinori nemoto wimvipa luangpiensamut yuji b abe takeshi kimura kanako komiya koji fujimoto yoshiyuki kotani department of computer and information science tokyo university agriculture technology nakamachi koganei shi japan btensor consulting koujimachi chiyoda ku riekon m kittykimura gmail com st tuat ac jp wizdomowl yahoo co tensor kkomiya cc abstract however sometimes it is not easy for foreign buyers find products they want because there are some cases where non language difference in our case alphabetic unable that input by through shopping should be translated into show web sites require pages which we propose transliterate many people inputs user i e search get no or wrong result their research written english alphabets classified three first katakana solve this problem write pairs query failed expected would solved right match obtained second website its transcribed word given volunteers were us...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area