172x Filetype PDF File size 0.46 MB Source: repository.unmuhjember.ac.id
1 ARule-Based Approach for Building an Artificial English-ASL Corpus Zouhour Tmar and Achraf Othman and Mohamed Jemni, Research Lab. LaTICE, University of Tunis, Tunisia Abstract—A serious problem facing the Community for re- project mainly in its exploitation step and to encourage its searchers in the field of sign language is the absence of a large wide use by different communities. In this paper, we review parallel corpus for signs language. The ASLG-PC12 project our experiences with constructing one such large annotated proposes a rule-based approach for building big parallel corpus parallel corpus between English written text and American between English written texts and American Sign Language Sign Language Gloss the ASLG-PC12, a corpus consisting of Gloss. We present a novel algorithm which transforms an English part-of-speech sentence to ASL gloss. This project was started over one hundred million pairs of sentences. in the beginning of 2010, a part of the project WebSign, and The paper is organized as follow. Section 2 presents some it offers today a corpus containing more than one hundred several projects concerning sign language. In section 3, we million pairs of sentences between English and ASL gloss. It describe the gloss notation system. After, we define methods is available online for free in order to develop and design new and pre-processing tasks for collecting data from the Guten- algorithms and theories for American Sign Language processing, for example statistical machine translation and any related fields. berg Project [9]. We present two stages of pre-processing, in In this paper, we present tasks for generating ASL sentences from whicheachsentencehadbeenextracted and tokenized. Section the corpus Gutenberg Project that contains only English written 4 presents our method and algorithms for constructing the texts. second part of the corpus in American Sign Language Gloss. Index Terms—Natural Language Processing, Sign Language, Constructed texts was generated automatically by transforma- Parallel Corpora. tion rules and then corrected by human experts in ASL. I. INTRODUCTION Wedescribe also the composition and the size of the corpus. Discussions and conclusion are drawn in section 5 and 6. O develop an automatic translator or any tools that Trequires a learning task for sign languages, the ma- II. BACKGROUND jor problem is the collection of a parallel corpus between Several projects, concerned with Sign Language, recorded text and Sign Lan-guage. A parallel corpus is a large and or annotated their own corpora, but only few of them are structured texts aligned between source and target languages. suitable for automatic Sign Language translation due to the They are used to do statistical analysis [1] and hypothesis number of available data for learning and processing. The testing, checking occur-rences or validating linguistic rules European Cultural Heritage Online organization (ECHO) pub- on a specific universe. Since there is no standard corpus lished corpora for British Sign Language [10] Swedish Sign and sufficient [2] because the data to develop an automatic Language [11] and the Sign Language of the Netherlands [12]. translation based on statistics, without pre-treatment prior to All of the corpora include several stories signed by a single the execution of the process of learning require an important signer. The American Sign Language Linguistic Research volume of data. In many ways, progress in sign language group at Boston University published a corpus in American research is driven by the availability of data. For these reasons, Sign Language [13]. TV broadcast news for the hearing im- we started to collect pairs of sentences between English and paired are another source of sign language recordings. Aachen American Sign Language Gloss [3]. And due to absence of University published a German Sign Language Corpus of the data especially in ASL and in other side there exist a huge Domain Weather Report [14], [15] published a multimedia data of English written text; we have developed a corpus corpus in Sign Language for machine Translation. In literature, based on a collaborative approach where experts contribute to wefoundmanyrelatedprojectsaimingtobuildcorpusforSign the collection or correction of bilingual corpus or to validate Language. Most of them are based on video recording and we the automatic translation. This project [4] was started in cannot find textual data toward building translation memory. 2010, as a part of the project WebSign [5] [6] that carries Textual data for Sign Language is not a simple written form, on developing tools able to make information over the web because signs can contain others information line eye gaze or accessible for deaf [7] [8]. The main goal of our project is to facial expressions. So, for our corpus, we will use glosses to develop a Web-based interpreter of Sign Language (SL). This represent Sign Language. In the next section, we will present tool would enable people who do not know Sign Language a brief description about glosses. to communicate with deaf individuals. Therefore, contribute in reducing the language barrier between deaf and hearing III. GLOSSING SIGNS people. Our secondary objective is to distribute this tool on a non-profit basis to educators, students, users, and researchers, Stokoe [16] proposed the first annotation system for de- and to disseminate a call for contribution to support this scribing Sign Language. Before, signs were thought of as U.S. Government work not protected by U.S. copyright 2 unanalyzed wholes, with no internal structure. The Stokoe notation system is used for writing American Sign Language using graphical symbols. After, others notation systems ap- peared like HamNoSys [17] and SignWriting. Furthermore, Glosses are used to write signs in textual form. Glossing means choosing an appropriate English word for signs in order to write them down. It is not a translating, but, it is similar to translating. A gloss of a signed story can be a series of English words, written in small capital letters that correspond to the signs in ASL story. Some basic conventions used for glossing are as follows: • Signs are represented with small capital letters in English. • Lexicalized finger-spelled words are written in small capital letters and preceded by the ’♯’ symbol. • Full finger-spelling is represented by dashes between small capital letters (for example, A-H-M-E-D). • Non-manual signals and eye-gaze are represented on a line above the sign glosses. Fig. 1. Occurrence of Zipf’s Law in Gutenberg Corpora of English Texts, In this work, we use glosses to represent Sign Language. In the the top ten words are (the, I, and, to, of, a, in, that, was, it) next section, we will describe steps for building our corpus. IV. PARALLEL CORPUS COLLECTION A. Collecting data from Gutenberg an available tool for splitting called Splitta. The models are Acquisition of a parallel corpus for the use in a statistical trained from Wall Street Journal news combined with the analysis typically takes several pre-processing steps. In our BrownCorpuswhichisintendedtobewidelyrepresentative of case, there isn’t enough data between English texts and Amer- written English. Error rates on test news data are near 0.25%. ican Sign Language. We start collecting only English data Also, we use CoreNLP tool . It is a set of natural language from Gutenberg Project toward transform it to ASL gloss. analysis tools which can take raw English language text input Gutenberg Project [9] offers over 38K free ebooks and more and give the base forms of words, their parts of speech. than 100K ebook through their partners. Collecting task is made in five steps: • Obtain the raw data (by crawling all files in the FTP directory). • Extract only English texts, because there exist ebook in others languages than English like German, Spanish. We found also files containing AND sequences. • Break the text into sentences (sentence splitting task). • Prepare the corpora (normalization, tokenization). In the following, we will describe in detail the acquisition of the Gutenberg corpus from FTP directory. Figure 1 shows the occurrence of Zipf’s Law in Gutenberg Corpora of English Texts. We found that words likes (the, I, and, to, of, a, in, that, was, it) are the top ten used words in corpora. Also this metrics determines which words are frequently used in English. Also, a huge work was made to remove non-English texts. B. Sentence splitting, tokenization, chunking and parsing Sentence splitting and tokenization require specialized tools for English texts. One problem of sentence splitting is the ambiguity of the period ”.” as either an end of sentence marker, or as a marker for an abbreviation. For English, we semi-automatically created a list of known abbreviations that are typically followed by a period. Issues with tokenization Fig. 2. An example of transformation: English input ’What did Bobby buy include the English merging of words such as in ”can’t” (which yesterday?’ we transform to ”can not”), or the separation of possessive markers (”the man’s” becomes ”the man ’s”). We use also 3 V. ENGLISH-ASL PARALLEL CORPUS A. Problematic As we say in the beginning, the main problem to process American Sign Language for statistical analysis like statistical machine translation is the absence of data (corpora or corpus), especially in Gloss format. By convention, the meaning of a sign is written correspondence to the language talking to avoid the complexity of understanding. For example, the phrase ”Do you like learning sign language?” is glossed as ”LEARN SIGN YOU LIKE?”. Here, the word ”you” is replaced by the gloss ”YOU” and the word ”learning” is rated ”LEARN”. Our machine translate must generate, after learning step, the sentence in gloss of an English input. B. Ascertainment and approach Generally, in research on statistical analysis of sign lan- guage, the corpus is annotated video sequences. In our case, we only need a bilingual corpus, the source language is English and the language is American Sign Language glosses transcribed. In this study, we started from 880 words (En- glish and ASL glosses) coupled with transformation rules. From these rules, we generated a bilingual corpus containing 800 million words. In this corpus, it is not interested in semantics or types of verbs used in sign language verbs Fig. 3. Steps for building ASL corpora such as ”agreement” or ”non-agreement”. Figure 2 shows an example of transformation between written English text and its generated sentence in ASL. The input is ”What did Bobby buy yesterday ?” and the target sentence is ”BOBBY BUY WHAT YESTERDAY ?”. In this example, we save the word ”YESTERDAY”andwecanfoundinsomereference ”PAST” which indicates the past tense and the action was made in the past. Also, for the symbol ”?” it can be replaced by a facial animation with ”WHAT”. For us, we are based on lemmatization of words. We keep the maximum of information Fig. 4. Size of the American Sign Language Gloss Parallel Corpus 2012 in the sentence toward developing more approaches in these (ASLG-PC12) corpora. Statistics of corpora are shown in figure 4. The number of sentences and tokens is huge and building ASL corpus takes more than one week. All parts are available to download online for free [4]. Using transformation rule in figure 5, we build the ASL corpora following steps shown in figure 3. The input of the system is English sentences and the output is the ASL transcription in gloss. In figure 5, only simple rules are shown, we can define complex rule starting from these simple rules. We can define a part-of-speech sentence for the two languages. According to figure 3, when we check if the rule of S exists in database, the algorithm will return true, in this case, we apply directly the transformation. Of course, all complex rules must be done by experts in ASL. Table 5 shows some transformation from Fig. 5. Example of full sentences transformation rules English sentence to American Sign Language. We present the transformation rule made by an expert in linguistics. In 3, we describe steps to transform an English sentence into try to transform the input for each lemma. In some case, we American Sign Language gloss. The input of the system is the can found that the part-of-speech sentence doesn’t exist in English sentence. Using CoreNLP tool, we generate an XML the database, so, we transform each lemma. Transformation file containing morphological information about the sentence rule for lemma is presented in 5. In the last step, we add an after tokenization task. Then, we build the part-of-speech uppercase script to transform the output. The transformation sentence and thanks to the transformation rules database, we rule is not a direct transformation for each lemma, it can an 4 alignment of words and can ignore some English words like [6] ——, “An avatar based approach for automatic interpretation of text to (the, in, a, an, ...). sign language,” in 9th European Conference for the Advancement of the Assistive Technologies in Europe, San Sebastian (Spain), 3- 5 October, 2007. C. Transformation Rules [7] ——,“Asystemtomakesignsusingcollaborative approach,” in ICCHP, Lecture Notes in Computer Science Springer Berlin / Heidelberg, Not all transformation rules used to transform English data pp.670-677 vol.5105, Linz Austria, 2008. was verified by experts in linguistics. We validate only 800 [8] M. Jemni, O. E. Ghoul, and N. B. Yahya, “Sign language mms to make rules and transformation rules for lemma. We cannot validate cell phones accessible to deaf and hard-of-hearing community,” in CVHI, Euro-Assist Conference and Workshop on Assistive technology for people all rules because there exist an infinite number of rules. For with Vision and Hearing impairments, Granada, Spain, 2007. this reason, we developed an application that offer to experts to [9] Project, “Gutenberg,” 2012. [Online]. Available: enter their rules from an English sentence, without coding. The http://www.gutenberg.org/ [10] B. Woll, R. Sutton-Spence, and D. Waters, “Echo data set for british application is just a simple user interface which contain lemma sign language (bsl),” in Department of Language and Communication transformation rule, and the expert will compose lemma. After Science,City University (London), 2004. that, he save the result and rebuild the corpora. The built [11] B. Bergman and J. Mesch, “Echo data set for swedish sign language (ssl),” in Department of Linguistics, University of Stockholm, 2004. corpus is a made by a collaborative approach and validated [12] O. Crasborn, E. Kooij, A. Nonhebel, and W. Emmerik, “Echo data by experts. set for sign language of the netherlands (ngt),” in Department of Linguistics,Radboud University Nijmegen, 2004. [13] V. Athitsos, C. Neidle, S. Sclaroff, J. Nash, R. Stefan, A. Thangali, D. Releases of the English-ASL Corpus H. Wang, and Q. Yuan, “Large lexicon project: American sign language video corpus and sign language indexing/retrieval algorithms,” in Pro- The initial release of this corpus consisted of data up to ceedings of the 4th Workshop on the Representation and Processing September 2011. The second release added data up to January of Sign Languages:Corpora and Sign Language Technologies, LREC, 2010. 2012, increasing the size from just over 800 sentences to up to [14] J. Bungeroth, D. Stein, M. Zahedi, and H. Ney, “A german sign language 800 million words in English. A forthcoming third release will corpus of the domain weather report,” in In 5th international, 2006, include data up to early 2013 and will have better tokenization p. 29. [15] S. Morrissey, H. Somers, R. Smith, S. Gilchrist, and I. D, “4th workshop and more words in American Sign Language. For more details, on the representation and processing of sign languages: Corpora and please check the website [4]. sign language technologies building a sign language corpus for use in machine translation,” 2010. [16] W. C. Stokoe, “10th anniversary classics sign language structure: An VI. DISCUSSION AND CONCLUSION outline of the visual communication systems of the american deaf,” 1960. Wedescribedtheconstruction of the English-American Sign [17] S. Prillwitz and H. Zienert, “Hamburg notation system for sign language: Language corpus. We illustrate a novel method for transform- Development of a sign writing with computer application,” in Current ing an English written text to American Sign Language gloss. trends in European sign language research: Proceedings of the 3rd European Congress on Sign Language Research, 1990. This corpus will be useful for statistical analysis for ASL [18] M. boulares and M. Jemni, “Mobile sign language translation system or any related fields [18] [19]. We present the first corpus for deaf community,” in W4A ACM,9th International Cross-Disciplinary for ASL gloss that exceeds one hundred million of sentences Conference on Web Accessibility, Lyon, France, 2012. [19] K. Jaballah and M. Jemni, “Toward automatic sign language recognition available for all researches and linguistics. During the next from web3d based scenes,” in ICCHP, Lecture Notes in Computer phase of the ASLG-PC12 project, we expect to provide both a Science Springer Berlin / Heidelberg, pp.205-212 vol.6190, Vienna richer analysis of the existing corpus and others parallel corpus Austria, 2010. (like French Sign Language, Arabic Sign Language, etc.). This will be done by first enriching the rules through experts. Enrichment will be achieved by automatically transforming the current transformation rules database, and then validating the results by hand. REFERENCES [1] A. Othman and M. Jemni, “Statistical sign language machine translation: from english written text to american sign language gloss,” in IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 5, No 3, 2011. [2] M. Sara and W. Andy, “Joining hands: Developing a sign language machine translation system with and for the deaf community,” in Pro- ceedings of the Conference and Workshop on Assistive Technologies for People with Vision and Hearing Impairments (CVHI-2007), Granada, Spain, 28th - 31th August, 2007, vol. 415, 2007. [3] A. Othman and M. Jemni, “Englishasl gloss parallel corpus 2012: Aslgpc12,” in 5th Workshop on the Representation and Processing of Sign Languages: Interactions between Corpus and Lexicon, Istanbul, Turkey, 2012. [4] A. Othman, “American sign language gloss parallel corpus 2012 (aslg- pc12),” 2012. [Online]. Available: http://www.achrafothman.net/aslsmt/ [5] M. Jemni and O. E. Ghoul, “Towards web-based automatic interpretation of written text to sign language,” in First International conference on ICT and Accessibility (ICTA-2007), Hammamet, Tunisia, April,, 2007.
no reviews yet
Please Login to review.