115x Filetype PDF File size 0.06 MB Source: www.isca-speech.org
Modeling Vowels for Arabic BN Transcription Abdel. Messaoudi,∗ Lori Lamel and Jean-Luc Gauvain SpokenLanguageProcessing Group LIMSI-CNRS,BP133 91403Orsaycedex, FRANCE {abdel,gauvain,lamel}@limsi.fr ABSTRACT given root, produced by appending articles (“the, and, to, This paper describes the LIMSI Arabic Broadcast News sys- from, with, ...”) to the word beginning and possessives temwhichproducesavowelizedwordtranscription. The under (“ours, theirs, ...”) on the word end. The right-to-left na- 10x system, evaluated in the NIST RT-04F evaluation, uses a ture of the Arabic texts required modification to the text 3 pass decoding strategy with gender- and bandwidth-specific processing utilities. Written texts are by and large non- acoustic models, a vowelized 65k word class pronunciation vowelized, meaningthat the short vowels and gemination lexicon and a word-class 4-gram language model. In order marks are not indicated. There are typically several pos- to explicitly represent the vowelized word forms, each non- sible (generally semantically linked) vowelizations for a vowelized word entry is considered as a word class regrouping given written word, which are spoken. The word-final all of its associated vowelized forms. vowelvariesasafunctionofthewordcontext,andthisfi- Since Arabic texts are almost exclusively written without nal vowel or vowel-/n/ sequence is often not pronounced. vowels, an important challenge is to be able to use these effi- 536 ciently in a system producing a vowelized output. Since a por- Thus one of the challenges faced when explicitly model- tion of the acoustic training data was manually transcribed with ing vowels in Arabic is to obtain vowelized resources, or short vowels, enabling an initial set of acoustic models to be to develop efficient ways to use non-vowelized data. It is h.2005- estimated in a supervised manner. The remaining audio data, often necessary to understand the text in order to know eec for which vowels are not annotated, were trained in an implicit how to vowelize and pronounce it correctly. We inves- manner using the recognizer to choose the preferred form. The tigate using the Buckwalter Arabic Morphological Ana- tersp system was trained on a total of about 150 hours of audio data lyzer to propose possible multiple vowelized word forms, and almost 600 million words of Arabic texts, and achieved and use a speech recognizer to automatically select the word error rates of 16.0% and 18.5% on the dev04 and eval04 most appropriate one. data, respectively. 10.21437/In 1. INTRODUCTION 2. ARABICLANGUAGERESOURCES This paper describes some recent work improving our The audio corpus contains about 150 hours of radio broadcastnewstranscriptionsystemforModernStandard and television broadcast news data from a variety of Arabic as described in [9]. By Modern Standard Arabic sources including VOA, NTV from the TDT4 corpus, werefer to the spoken version of the official written lan- Cairo Radio from FBIS (recorded in 2000 and 2001 and guage, which is spoken in much of the Middle East and distributed by the LDC), and Radio Elsharq (Syria), Ra- NorthAfrica, and is used in major broadcast news shows. dio Kuwait, Radio Orient (Paris), Radio Qatar, Radio The Arabic language poses challenges somewhat differ- Syria, BBC, Medi1, Aljazeera (Qatar), TV Syria, TV7, ent fromtheotherlanguages(mostlyIndo-EuropeanGer- and ESC[9]. manic or Romance) we have worked with. Modern Stan- Aportion of the audio data were collected during the dard Arabic is that which is learned in school, used in period from September 1999 through October 2000, and most newspapers and is considered to be the official lan- from April 2001 through the end of 2002 [9]. These guage in most Arabic speaking countries. In contrast data were manually transcribed using an Arabic version many people speak in dialects for which there is only of Transcriber [1] and an Arabic keyboard. The manual a spoken from and no recognized written form. Arabic transcriptions are vowelized, enabling accurate modeling texts are written and read from right-to-left and the vow- of the short vowels, even though these are not usually els are generally not indicated. It is a strongly consonan- present in written texts. This is different from the ap- tal language with nominally only three vowels, each of proach taken by Billa et al. [2] where only characters in which has a long and short form. Arabic is a highly in- the non-vowelized written form are modeled. Each Ara- flected language, with many different word forms for a bic character, including short vowel and geminate mark- ers, is transliterated to a single ascii character. Tran- ∗Visiting scientist from the Vecsys Company. scription conventions were developed to provide guid- ance for marking vowels and dealing with inflections and Vowelized lexicon gemination, as well as to consistently transcribe foreign kitaAb kitAb words, in particular for proper names and places, which kitaAba kitAba are quite common in Arabic broadcast news. The for- kitaAbi kitAbi eign words can have a variety of spoken realizations de- kut˜aAbi kuttAbi pending upon the speaker’s knowledge of the language Non-Vowelized lexicon of origin and how well-known the particular word is to ktAb kitAb=kitaAb the target audience. These vowelized transcripts contain kitAba=kitaAba 580k words, with 50k distinct non-vowelized forms (85k kitAbi=kitaAbi different vowelized forms). kuttAbi=kut˜aAbi Vowelized trancripts were not available for the TDT4 sbEyn sabEIna=saboEiyna and FBIS data. Training was based on time-aligned seg- sabEIn=saboEiyn mented transcripts, shared with us by BBN, which had been derived from the associated closed-captions and Figure 1: Example lexical entries for the vowelized and commercial transcripts. These transcripts have about non-vowelized pronunciation lexicons. In the non-vowelized 520kwords(45kdistinct non-vowelized forms). lexicon, the pronunciation is on the left of the equal sign and the written form on the right. Combiningthetwosourcesofaudiotranscripts results in a total of 1.1M words, of which 70k (non-vowelized) the hamza), 3 foreign consonants (/p,v,g/), and 6 vowels are distinct. (short andlong/i/, /a/, /u/). In a fully expressed vowelized The written resources consist of almost 600 mil- pronunciationlexicon, each vowelized orthographic form lion words of texts from the Arabic Gigaword corpus of a word is treated as a distinct lexical entry. The exam- (LDC2003T12) and some additional Arabic texts ob- pleentriesfortheword“kitaAb”areshowninthetoppart tained from the Internet. The texts were preprocessed of Figure 1. An alternative representation uses the non- to remove undesirable material (tables, lists, punctuation vowelized orthographic form as the entry, allowing mul- markers) and transliterated using an slightly extended tiple pronunciations, each being associated with a partic- version of Buckwalter transliteration1 from the original ular written form. Each entry can be thought of as a word Arabic script form to improve readability. class, containing all observed (or even all possible) vow- The texts were then further processed for use in lan- elized forms of the word. The pronunciation is on the left guagemodeltraining. First the texts were segmented into of the equal sign and the vowelized written form is on the sentences, and then normalized in order to better approxi- right. This latter format is used for the 65k word lexicon, mate a spoken form. Common typographical errors were whereapronunciationgraphisassociatedwitheachword also corrected. The main normalization steps are sim- soastoallowforalternatepronunciations. Sincemultiple ilar to those used for processing texts in the other lan- vowelized forms are associated with each non-vowelized guages [4, 6]. They consist primarily of rules to expand word entry, the Buckwalter Arabic Morphological Ana- numerical expressions and abbreviations (km, kg, m2), lyzer was used to propose possible forms that were then and the treatment of acronyms (A. F. B. → A F B). A manually verified. The morphological analyzer was also frequent problem when processing numbers is the use of applied to words in the vowelized training data in order an incorrect (but very similar) character in place of the to propose forms that did not occur in the training data. comma (20r3 → 20,3). The most frequent errors that Asubset of the words, mostly proper names and techni- were corrected were: a missing Hamza above or below cal terms, were manually vowelized. The 65k vocabulary an Alif; missing (or extra diacritic marks) at word ends: contains 65539 words and 528,955 phone transcriptions. below y (eg. Alif maksoura), above h (eg. t marbouta); TheOOVratewiththe65kvocabularyrangesfromabout andmissingorerroneousinterwordspacing,whereeither 3% to 6%, depending upon the test data and reference twowordsweregluedtogetherorthefinalletterofaword transcript normalization (see Table 1). was glued to the next word. After processing there were Thedecoderwasmodifiedtohandlethenewstylelex- a total of 600 million words, of which 2.2 M are distinct. icon in order to produce the vowelized orthographic form 3. PRONUNCIATIONLEXICON associated with each wordhypothesis(insteadofthenon- vowelized word class). Letter to sound conversion is quite straightforward 4. RECOGNITIONSYSTEMOVERVIEW when starting from vowelized texts. A grapheme-to- The LIMSI broadcast news transcription system has phonemeconversiontoolwasdevelopedusingasetof37 two main components, an audio partitioner and a word phonemes and three non-linguistic units (silence/noise, recognizer. Data partitioning is based on an audio stream hesitation, breath). The phonemes include the 28 Ara- mixture model [3, 4], and serves to divide the continu- bic consonants (including the emphatic consonants and ous stream of acoustic data into homogeneous segments, 1T. Buckwalter, http://www.qamus.org/transliteration.htm associating cluster, gender and labels with each non- overlapping segment. For each speech segment, the word maximize the likelihood of the training data using single recognizer determines the sequence of words in the seg- Gaussian state models, penalized by the number of tied- ment,associatingstartandendtimesandanoptionalcon- states [4]. A set of 152 questions concern the phone posi- fidence measure with each word. The recognizer makes tion, the distinctive features (and identities) of the phone use of continuous density HMMs for acoustic model- and the neighboring phones. ing and n-gram statistics for language modeling. Each Asetofcontrastive acoustic models were trained only context-dependent phone model is a tied-state left-to- ontheaudiodatafromLDC(72hoursofdatafromVOA, right CD-HMM with Gaussian mixture observation den- NTV,andCairoRadio), for which the short vowels were sities where the tied states are obtained by means of a determinedautomatically. Thesmallsetofacousticmod- decision tree. els used in the first decoding pass have 5500 contexts Word recognition is performed in three passes, where and tied-states, and the larger set has 12000 contexts and each decoding pass generates a word lattice which is ex- 11500tied states with 32 Gaussians per state. panded with a 4-gram LM. Then the posterior probabili- Thetraining data were also used to build the Gaussian ties of the lattice edges are estimated using the forward- mixture models with 2048 components, used for acoustic backwardalgorithmandthe4-gramlatticeisconvertedto modeladaptation in the first decoding pass. a confusion network with posterior probabilities by iter- Languagemodels atively merging lattice vertices and splitting lattice edges until a linear graph is obtained. This last step gives com- The word class n-gram language models were parable results to the edge clustering algorithm proposed obtained by interpolation [10] backoff n-gram lan- in [8]. The words with the highest posterior in each con- guage models trained on subsets of the Arabic Gi- fusion set are hypothesized. gaword corpus (LDC2003T12) and some additional Pass 1: Initial Hypothesis Generation - This step Arabic texts obtained from the Internet. Compo- generates initial hypotheses which are then used for nent LMs were trained on the following data sets: cluster-based acoustic model adaptation. This is done via 1. Transcriptions of the audio data, 1.1M words one pass (less than 1xRT) cross-word trigram decoding 2. Agence France Presse (May94-Dec02), 94M words with gender-specific sets of position-dependent triphones 3. Al Hayat News Agency (Jan94-Dec01), 139M words (5700tiedstates) and a trigram language model (38M tri- 4. Al Nahar News Agency (Jan95-Dec02), 140M words grams and 15M bigrams). Band-limited acoustic models 5. Xinhua News Agency (Jun01-May03), 17M words are used for the telephone speech segments. The trigram 6. Addustour (1999-Apr01,) 22M words lattices are rescored with a 4-gram language models. 7. Ahram (1998-Apr01), 39M words Pass 2: Word Graph Generation - Unsupervised 8. Albayan (1998-Apr01), 61M words acoustic model adaptation is performed for each seg- 9. Alhayat (1998), 18M words ment cluster using the MLLR technique [7] with only 10. Alwatan (1998-2000), 29M words one regression class. The lattice is generated for each 11. Raya (1998-Apr01), 35M words segment using a bigram LM and position-dependent tri- The language model interpolation weights were tuned phones with 11500 tied states (32 Gaussians per state). to minimize the perplexity on a set of development shows Pass 3: Word Graph rescoring - The word graph from November 2003 shared by BBN. For the contrast generated in pass 2 is rescored after carrying out unsu- system, the transcriptions of the non-LDC audio data pervised MLLR acoustic model adaptation using two re- were removed from the language model training corpus, gression classes. reducing the amount of transcripts to about 520k words. Table 1 gives the OOV rates and perplexities with and Acoustic models without normalization of the reference transcripts for the The acoustic models are context-dependent, 3-state language models used in the Primary and Contrast sys- left-to-right hidden Markov models with Gaussian mix- tems. Normalization of the reference transcripts is seen ture. Two sets of gender-dependent, position-dependent to have a large effect on the OOV rate. triphones are estimated using MAP adaptation of SI seed 5. EXPERIMENTALRESULTS models for wideband and telephone band speech [5]. The triphone-based context-dependent phone models are Table 2 gives the performance of the Primary and Con- word-independentbutwordposition-dependent. Thefirst trast systems on the NIST RT-03 and RT-04 development decoding pass uses a small set of acoustic models with andtestdatasets(www.nist.gov/speech/tests/rt). The RT- about5700contextsandtiedstates. Alargersetofacous- 03developmentdatawassharedbyBBN,andconsistsof tic models, used in the second and third passes, cover four 30-minute broadcasts from January 2001 (2 VOA about 15800 phone contexts represented with a total of and2NTV).TheRT-03evaluationdataarecomprisedof 11500 states, and 32 Gaussians per state. State-tying is broadcast each from VOA and NTV, dating from Febru- carriedoutviadivisivedecisiontreeclustering,construct- ary2001. TheRT-04developmentdataconsistof3shows ing one tree for each state position of each phone so as to broadcasts at the end of November 2003 from Al-Jazeera Unnormalized dev03 eval03 dev04 eval04 results of the contrast system are shown in the last entry %OOV 4.3 7.3 7.8 7.1 of the table. PxPrimary 272.4 305.4 416.1 458.1 6. CONCLUSIONS PxContrast 271.7 306.2 422.8 462.9 Normalized dev03 eval03 dev04 eval04 This paper has reported on our recent development %OOV 3.3 4.0 4.8 6.4 work on transcribing Modern Standard Arabic broadcast PxPrimary 267.8 307.3 423.8 459.3 news data. Our acoustic models and lexicon explicitly PxContrast 269.2 308.9 430.9 464.6 modelshort vowels, even though these are removed prior to scoring. In order to be able make use of non-vowelized Table1: OOVratesandperplexityon4testsets(dev03,eval03, audio and textual resources, the recognition lexicon en- dev04 and eval04) with the Primary and Contrast language tries are word-classes which regroup all derived vow- models without (top) and with (bottom) normalization of the elized forms along with the associated phonetic forms. reference transcripts. The resulting 65k word-class vocabulary contains 529k and Dubai TV. The RT-04 evaluation data are from the phone transcriptions. The explicit internal representation samesources, but from the month of December. of vowelized word forms in the lexicon may be useful to provide an automatic (or semi-automatic) method to Condition dev03 eval03 dev04 eval04 vowelize transcripts. Successful use of audio data with- Baseline 19.3 24.7 24.4 23.8 out explicit vowels can reduce the cost and ease of data LDCAM 17.7 23.6 24.8 - transcription. Base+LDC 17.4 23.0 21.9 23.3 Our previous Arabic broadcast news system [9] had a +newwordlist 17.7 22.0 21.5 23.4 word error rate of about 24% on the RT-04 dev and eval +mllt, cmllr 16.4 21.6 20.3 21.7 data. By improving the acoustic and language models, +gigaword LM 14.7 20.0 18.4 20.6 updating the recognizer word list and pronunciation lexi- +pron 13.2 16.6 16.0 18.5 con, and the decoding strategy, a relative word error rate Contrast system 13.5 16.4 17.6 20.2 reduction of over 30% was acheived. On another set of 14BNshowsfromJuly2004(about6hoursofdatafrom Table 2: Word error rates on the RT-03 and RT-04 dev and eval 12sources), a word error of about 16.5% is obtained. data sets for different system configurations, using the eval04 REFERENCES glmfiles distributed by NIST. The baseline system had acoustic models trained on [1] C. Barras, E. Geoffrois et al., “Transcriber: development anduseofatoolforassisting speech corpora production,” only the non-LDC audio data, and the language model Speech Communication, 33(1-2):5-22 Jan 2001. training made use of about 200 M words of newspaper [2] J. Billa, N. Noamany et al., “Audio Indexing of Arabic texts with most of the data coming from the years 1998- Broadcast News,” ICASSP’02, 1:5-8, Apr 2002. 2000, and early 2001. With this system, the word er- [3] J.L. Gauvain, L. Lamel, G. Adda, “Partitioning and Tran- ror is about 20% for dev03, and 24% for the other data scription of Broadcast News Data,” ICSLP’98, 5:1335- sets. The second entry (LDC AM) gives the word error 1338, Dec 1998. rates with the acoustic models trained only on the LDC [4] J.L. Gauvain, L. Lamel, G. Adda, “The LIMSI Broad- TDT4 and FBIS data. The word error is lower for the cast News Transcription System,” Speech Communica- dev03 data, which can be attributed to the training and tion, 37(1-2):89-108, May 2002. developmentdatabeingfromthesamesources. Theerror [5] J.L. Gauvain, C.H. Lee, “Maximum A Posteriori for Multivariate Gaussain Mixture Observation of Markov rates are somewhat higher on the other test sets. Pooling Chains,” IEEE Trans. on Speech and Audio Processing, the audio training data, as done for the primary system 2(2):291-298, Apr 1994. acoustic models, gives lower word error rates, and also [6] L. Lamel, J.L. Gauvain, “Automatic Processing of exhibits less variation across the test sets. The remain- Broadcast Audio in Multiple Languages,” Eusipco’02, ing entries show the effects of other changes to the sys- Sep2002. tem. A new word list was selected using an automatic [7] C.J. Leggetter, P.C. Woodland, “Maximum likelihood lin- method, that did not necessarily include all words in the ear regression for speaker adaptation of continuous den- audio transcripts. Incorporating MLLT feature normal- sity hidden Markov models,” Computer Speech and Lan- guage, 9(2):171-185, 1995. ization and CMLLR resulted in a gain of over 1% abso- [8] L.Mangu,E.Brill,A.Stolke,“FindingConsensusAmong lute on most of the data sets. Finally, the language model Words: Lattice-Based Word Error Minimization,” Eu- and word list were updated using the Gigaword corpus rospeeech’99, 495-498, Sep 1999. which also included more recent training texts, and pro- [9] A. Messaoudi, L. Lamel, J.L. Gauvain, “Transcription of nunciation probabilities were used during the consensus Arabic Broadcast News,” ICSLP’04, Oct 2004. network decoding stage, resulting in a word error rate of [10] P.C.Woodland,T.Neieler,E.Whittaker,”LanguageMod- 16.0% on the dev04 data and 18.5% on eval04. This en- eling in the HTK Hub5 LVCSR,” presented at the 1998 try corresponds to our primary system submission. The Hub5EWorkshop,Sep1998.
no reviews yet
Please Login to review.