jagomart
digital resources
picture1_Arabic Vowels Pdf 105035 | Messaoudi05 Interspeech


 115x       Filetype PDF       File size 0.06 MB       Source: www.isca-speech.org


File: Arabic Vowels Pdf 105035 | Messaoudi05 Interspeech
modeling vowels for arabic bn transcription abdel messaoudi lori lamel and jean luc gauvain spokenlanguageprocessing group limsi cnrs bp133 91403orsaycedex france abdel gauvain lamel limsi fr abstract given root produced ...

icon picture PDF Filetype PDF | Posted on 24 Sep 2022 | 3 years ago
Partial capture of text on file.
                                     Modeling Vowels for Arabic BN Transcription
                                             Abdel. Messaoudi,∗ Lori Lamel and Jean-Luc Gauvain
                                                         SpokenLanguageProcessing Group
                                                                 LIMSI-CNRS,BP133
                                                            91403Orsaycedex, FRANCE
                                                           {abdel,gauvain,lamel}@limsi.fr
                                      ABSTRACT                                     given root, produced by appending articles (“the, and, to,
                   This paper describes the LIMSI Arabic Broadcast News sys-       from, with, ...”) to the word beginning and possessives
                temwhichproducesavowelizedwordtranscription. The under             (“ours, theirs, ...”) on the word end. The right-to-left na-
                10x system, evaluated in the NIST RT-04F evaluation, uses a        ture of the Arabic texts required modification to the text
                3 pass decoding strategy with gender- and bandwidth-specific        processing utilities. Written texts are by and large non-
                acoustic models, a vowelized 65k word class pronunciation          vowelized, meaningthat the short vowels and gemination
                lexicon and a word-class 4-gram language model. In order           marks are not indicated. There are typically several pos-
                to explicitly represent the vowelized word forms, each non-        sible (generally semantically linked) vowelizations for a
                vowelized word entry is considered as a word class regrouping      given written word, which are spoken. The word-final
                all of its associated vowelized forms.                             vowelvariesasafunctionofthewordcontext,andthisfi-
                   Since Arabic texts are almost exclusively written without       nal vowel or vowel-/n/ sequence is often not pronounced.
                vowels, an important challenge is to be able to use these effi-
       536      ciently in a system producing a vowelized output. Since a por-     Thus one of the challenges faced when explicitly model-
                tion of the acoustic training data was manually transcribed with   ing vowels in Arabic is to obtain vowelized resources, or
                short vowels, enabling an initial set of acoustic models to be     to develop efficient ways to use non-vowelized data. It is
       h.2005-  estimated in a supervised manner. The remaining audio data,        often necessary to understand the text in order to know
       eec      for which vowels are not annotated, were trained in an implicit    how to vowelize and pronounce it correctly. We inves-
                manner using the recognizer to choose the preferred form. The      tigate using the Buckwalter Arabic Morphological Ana-
       tersp    system was trained on a total of about 150 hours of audio data     lyzer to propose possible multiple vowelized word forms,
                and almost 600 million words of Arabic texts, and achieved         and use a speech recognizer to automatically select the
                word error rates of 16.0% and 18.5% on the dev04 and eval04        most appropriate one.
                data, respectively.
       10.21437/In               1. INTRODUCTION                                        2. ARABICLANGUAGERESOURCES
                   This paper describes some recent work improving our                The audio corpus contains about 150 hours of radio
                broadcastnewstranscriptionsystemforModernStandard                  and television broadcast news data from a variety of
                Arabic as described in [9]. By Modern Standard Arabic              sources including VOA, NTV from the TDT4 corpus,
                werefer to the spoken version of the official written lan-          Cairo Radio from FBIS (recorded in 2000 and 2001 and
                guage, which is spoken in much of the Middle East and              distributed by the LDC), and Radio Elsharq (Syria), Ra-
                NorthAfrica, and is used in major broadcast news shows.            dio Kuwait, Radio Orient (Paris), Radio Qatar, Radio
                The Arabic language poses challenges somewhat differ-              Syria, BBC, Medi1, Aljazeera (Qatar), TV Syria, TV7,
                ent fromtheotherlanguages(mostlyIndo-EuropeanGer-                  and ESC[9].
                manic or Romance) we have worked with. Modern Stan-                   Aportion of the audio data were collected during the
                dard Arabic is that which is learned in school, used in            period from September 1999 through October 2000, and
                most newspapers and is considered to be the official lan-           from April 2001 through the end of 2002 [9]. These
                guage in most Arabic speaking countries. In contrast               data were manually transcribed using an Arabic version
                many people speak in dialects for which there is only              of Transcriber [1] and an Arabic keyboard. The manual
                a spoken from and no recognized written form. Arabic               transcriptions are vowelized, enabling accurate modeling
                texts are written and read from right-to-left and the vow-         of the short vowels, even though these are not usually
                els are generally not indicated. It is a strongly consonan-        present in written texts. This is different from the ap-
                tal language with nominally only three vowels, each of             proach taken by Billa et al. [2] where only characters in
                which has a long and short form. Arabic is a highly in-            the non-vowelized written form are modeled. Each Ara-
                flected language, with many different word forms for a              bic character, including short vowel and geminate mark-
                                                                                   ers, is transliterated to a single ascii character.   Tran-
                   ∗Visiting scientist from the Vecsys Company.                   scription conventions were developed to provide guid-
               ance for marking vowels and dealing with inflections and                          Vowelized lexicon
               gemination, as well as to consistently transcribe foreign                       kitaAb      kitAb
               words, in particular for proper names and places, which                         kitaAba     kitAba
               are quite common in Arabic broadcast news. The for-                             kitaAbi     kitAbi
               eign words can have a variety of spoken realizations de-                        kut˜aAbi    kuttAbi
               pending upon the speaker’s knowledge of the language                           Non-Vowelized lexicon
               of origin and how well-known the particular word is to                      ktAb     kitAb=kitaAb
               the target audience. These vowelized transcripts contain                             kitAba=kitaAba
               580k words, with 50k distinct non-vowelized forms (85k                               kitAbi=kitaAbi
               different vowelized forms).                                                          kuttAbi=kut˜aAbi
                 Vowelized trancripts were not available for the TDT4                      sbEyn    sabEIna=saboEiyna
               and FBIS data. Training was based on time-aligned seg-                               sabEIn=saboEiyn
               mented transcripts, shared with us by BBN, which had
               been derived from the associated closed-captions and          Figure 1:  Example lexical entries for the vowelized and
               commercial transcripts.   These transcripts have about        non-vowelized pronunciation lexicons. In the non-vowelized
               520kwords(45kdistinct non-vowelized forms).                   lexicon, the pronunciation is on the left of the equal sign and
                                                                             the written form on the right.
                 Combiningthetwosourcesofaudiotranscripts results
               in a total of 1.1M words, of which 70k (non-vowelized)        the hamza), 3 foreign consonants (/p,v,g/), and 6 vowels
               are distinct.                                                 (short andlong/i/, /a/, /u/). In a fully expressed vowelized
                 The written resources consist of almost 600 mil-            pronunciationlexicon, each vowelized orthographic form
               lion words of texts from the Arabic Gigaword corpus           of a word is treated as a distinct lexical entry. The exam-
               (LDC2003T12) and some additional Arabic texts ob-             pleentriesfortheword“kitaAb”areshowninthetoppart
               tained from the Internet. The texts were preprocessed         of Figure 1. An alternative representation uses the non-
               to remove undesirable material (tables, lists, punctuation    vowelized orthographic form as the entry, allowing mul-
               markers) and transliterated using an slightly extended        tiple pronunciations, each being associated with a partic-
               version of Buckwalter transliteration1 from the original      ular written form. Each entry can be thought of as a word
               Arabic script form to improve readability.                    class, containing all observed (or even all possible) vow-
                 The texts were then further processed for use in lan-       elized forms of the word. The pronunciation is on the left
               guagemodeltraining. First the texts were segmented into       of the equal sign and the vowelized written form is on the
               sentences, and then normalized in order to better approxi-    right. This latter format is used for the 65k word lexicon,
               mate a spoken form. Common typographical errors were          whereapronunciationgraphisassociatedwitheachword
               also corrected. The main normalization steps are sim-         soastoallowforalternatepronunciations. Sincemultiple
               ilar to those used for processing texts in the other lan-     vowelized forms are associated with each non-vowelized
               guages [4, 6]. They consist primarily of rules to expand      word entry, the Buckwalter Arabic Morphological Ana-
               numerical expressions and abbreviations (km, kg, m2),         lyzer was used to propose possible forms that were then
               and the treatment of acronyms (A. F. B. → A F B). A           manually verified. The morphological analyzer was also
               frequent problem when processing numbers is the use of        applied to words in the vowelized training data in order
               an incorrect (but very similar) character in place of the     to propose forms that did not occur in the training data.
               comma (20r3 → 20,3). The most frequent errors that            Asubset of the words, mostly proper names and techni-
               were corrected were: a missing Hamza above or below           cal terms, were manually vowelized. The 65k vocabulary
               an Alif; missing (or extra diacritic marks) at word ends:     contains 65539 words and 528,955 phone transcriptions.
               below y (eg. Alif maksoura), above h (eg. t marbouta);        TheOOVratewiththe65kvocabularyrangesfromabout
               andmissingorerroneousinterwordspacing,whereeither             3% to 6%, depending upon the test data and reference
               twowordsweregluedtogetherorthefinalletterofaword               transcript normalization (see Table 1).
               was glued to the next word. After processing there were         Thedecoderwasmodifiedtohandlethenewstylelex-
               a total of 600 million words, of which 2.2 M are distinct.    icon in order to produce the vowelized orthographic form
                       3. PRONUNCIATIONLEXICON                               associated with each wordhypothesis(insteadofthenon-
                                                                             vowelized word class).
                 Letter to sound conversion is quite straightforward            4. RECOGNITIONSYSTEMOVERVIEW
               when starting from vowelized texts.    A grapheme-to-           The LIMSI broadcast news transcription system has
               phonemeconversiontoolwasdevelopedusingasetof37                two main components, an audio partitioner and a word
               phonemes and three non-linguistic units (silence/noise,       recognizer. Data partitioning is based on an audio stream
               hesitation, breath). The phonemes include the 28 Ara-         mixture model [3, 4], and serves to divide the continu-
               bic consonants (including the emphatic consonants and         ous stream of acoustic data into homogeneous segments,
                  1T. Buckwalter, http://www.qamus.org/transliteration.htm   associating cluster, gender and labels with each non-
               overlapping segment. For each speech segment, the word      maximize the likelihood of the training data using single
               recognizer determines the sequence of words in the seg-     Gaussian state models, penalized by the number of tied-
               ment,associatingstartandendtimesandanoptionalcon-           states [4]. A set of 152 questions concern the phone posi-
               fidence measure with each word. The recognizer makes         tion, the distinctive features (and identities) of the phone
               use of continuous density HMMs for acoustic model-          and the neighboring phones.
               ing and n-gram statistics for language modeling. Each          Asetofcontrastive acoustic models were trained only
               context-dependent phone model is a tied-state left-to-      ontheaudiodatafromLDC(72hoursofdatafromVOA,
               right CD-HMM with Gaussian mixture observation den-         NTV,andCairoRadio), for which the short vowels were
               sities where the tied states are obtained by means of a     determinedautomatically. Thesmallsetofacousticmod-
               decision tree.                                              els used in the first decoding pass have 5500 contexts
                 Word recognition is performed in three passes, where      and tied-states, and the larger set has 12000 contexts and
               each decoding pass generates a word lattice which is ex-    11500tied states with 32 Gaussians per state.
               panded with a 4-gram LM. Then the posterior probabili-         Thetraining data were also used to build the Gaussian
               ties of the lattice edges are estimated using the forward-  mixture models with 2048 components, used for acoustic
               backwardalgorithmandthe4-gramlatticeisconvertedto           modeladaptation in the first decoding pass.
               a confusion network with posterior probabilities by iter-   Languagemodels
               atively merging lattice vertices and splitting lattice edges
               until a linear graph is obtained. This last step gives com-    The word class n-gram language models were
               parable results to the edge clustering algorithm proposed   obtained by interpolation [10] backoff n-gram lan-
               in [8]. The words with the highest posterior in each con-   guage models trained on subsets of the Arabic Gi-
               fusion set are hypothesized.                                gaword corpus (LDC2003T12) and some additional
                 Pass 1: Initial Hypothesis Generation - This step         Arabic texts obtained from the Internet.       Compo-
               generates initial hypotheses which are then used for        nent LMs were trained on the following data sets:
               cluster-based acoustic model adaptation. This is done via      1. Transcriptions of the audio data, 1.1M words
               one pass (less than 1xRT) cross-word trigram decoding          2. Agence France Presse (May94-Dec02), 94M words
               with gender-specific sets of position-dependent triphones       3. Al Hayat News Agency (Jan94-Dec01), 139M words
               (5700tiedstates) and a trigram language model (38M tri-        4. Al Nahar News Agency (Jan95-Dec02), 140M words
               grams and 15M bigrams). Band-limited acoustic models           5. Xinhua News Agency (Jun01-May03), 17M words
               are used for the telephone speech segments. The trigram        6. Addustour (1999-Apr01,) 22M words
               lattices are rescored with a 4-gram language models.           7. Ahram (1998-Apr01), 39M words
                 Pass 2: Word Graph Generation - Unsupervised                 8. Albayan (1998-Apr01), 61M words
               acoustic model adaptation is performed for each seg-           9. Alhayat (1998), 18M words
               ment cluster using the MLLR technique [7] with only           10. Alwatan (1998-2000), 29M words
               one regression class. The lattice is generated for each       11. Raya (1998-Apr01), 35M words
               segment using a bigram LM and position-dependent tri-          The language model interpolation weights were tuned
               phones with 11500 tied states (32 Gaussians per state).     to minimize the perplexity on a set of development shows
                 Pass 3: Word Graph rescoring - The word graph             from November 2003 shared by BBN. For the contrast
               generated in pass 2 is rescored after carrying out unsu-    system, the transcriptions of the non-LDC audio data
               pervised MLLR acoustic model adaptation using two re-       were removed from the language model training corpus,
               gression classes.                                           reducing the amount of transcripts to about 520k words.
                                                                           Table 1 gives the OOV rates and perplexities with and
               Acoustic models                                             without normalization of the reference transcripts for the
                 The acoustic models are context-dependent, 3-state        language models used in the Primary and Contrast sys-
               left-to-right hidden Markov models with Gaussian mix-       tems. Normalization of the reference transcripts is seen
               ture. Two sets of gender-dependent, position-dependent      to have a large effect on the OOV rate.
               triphones are estimated using MAP adaptation of SI seed              5. EXPERIMENTALRESULTS
               models for wideband and telephone band speech [5].
               The triphone-based context-dependent phone models are          Table 2 gives the performance of the Primary and Con-
               word-independentbutwordposition-dependent. Thefirst          trast systems on the NIST RT-03 and RT-04 development
               decoding pass uses a small set of acoustic models with      andtestdatasets(www.nist.gov/speech/tests/rt). The RT-
               about5700contextsandtiedstates. Alargersetofacous-          03developmentdatawassharedbyBBN,andconsistsof
               tic models, used in the second and third passes, cover      four 30-minute broadcasts from January 2001 (2 VOA
               about 15800 phone contexts represented with a total of      and2NTV).TheRT-03evaluationdataarecomprisedof
               11500 states, and 32 Gaussians per state. State-tying is    broadcast each from VOA and NTV, dating from Febru-
               carriedoutviadivisivedecisiontreeclustering,construct-      ary2001. TheRT-04developmentdataconsistof3shows
               ing one tree for each state position of each phone so as to broadcasts at the end of November 2003 from Al-Jazeera
                   Unnormalized       dev03     eval03    dev04     eval04           results of the contrast system are shown in the last entry
                   %OOV                 4.3       7.3       7.8       7.1            of the table.
                   PxPrimary          272.4     305.4     416.1     458.1                              6. CONCLUSIONS
                   PxContrast         271.7     306.2     422.8     462.9
                   Normalized         dev03     eval03    dev04     eval04              This paper has reported on our recent development
                   %OOV                 3.3       4.0       4.8       6.4            work on transcribing Modern Standard Arabic broadcast
                   PxPrimary          267.8     307.3     423.8     459.3            news data. Our acoustic models and lexicon explicitly
                   PxContrast         269.2     308.9     430.9     464.6            modelshort vowels, even though these are removed prior
                                                                                     to scoring. In order to be able make use of non-vowelized
                Table1: OOVratesandperplexityon4testsets(dev03,eval03,               audio and textual resources, the recognition lexicon en-
                dev04 and eval04) with the Primary and Contrast language             tries are word-classes which regroup all derived vow-
                models without (top) and with (bottom) normalization of the          elized forms along with the associated phonetic forms.
                reference transcripts.                                               The resulting 65k word-class vocabulary contains 529k
                and Dubai TV. The RT-04 evaluation data are from the                 phone transcriptions. The explicit internal representation
                samesources, but from the month of December.                         of vowelized word forms in the lexicon may be useful
                                                                                     to provide an automatic (or semi-automatic) method to
                  Condition            dev03     eval03    dev04     eval04          vowelize transcripts. Successful use of audio data with-
                  Baseline              19.3      24.7      24.4      23.8           out explicit vowels can reduce the cost and ease of data
                  LDCAM                 17.7      23.6      24.8        -            transcription.
                  Base+LDC              17.4      23.0      21.9      23.3              Our previous Arabic broadcast news system [9] had a
                  +newwordlist          17.7      22.0      21.5      23.4           word error rate of about 24% on the RT-04 dev and eval
                  +mllt, cmllr          16.4      21.6      20.3      21.7           data. By improving the acoustic and language models,
                  +gigaword LM          14.7      20.0      18.4      20.6           updating the recognizer word list and pronunciation lexi-
                  +pron                 13.2      16.6      16.0      18.5           con, and the decoding strategy, a relative word error rate
                  Contrast system       13.5      16.4      17.6      20.2           reduction of over 30% was acheived. On another set of
                                                                                     14BNshowsfromJuly2004(about6hoursofdatafrom
                Table 2: Word error rates on the RT-03 and RT-04 dev and eval        12sources), a word error of about 16.5% is obtained.
                data sets for different system configurations, using the eval04                            REFERENCES
                glmfiles distributed by NIST.
                   The baseline system had acoustic models trained on                  [1] C. Barras, E. Geoffrois et al., “Transcriber: development
                                                                                           anduseofatoolforassisting speech corpora production,”
                only the non-LDC audio data, and the language model                        Speech Communication, 33(1-2):5-22 Jan 2001.
                training made use of about 200 M words of newspaper                    [2] J. Billa, N. Noamany et al., “Audio Indexing of Arabic
                texts with most of the data coming from the years 1998-                    Broadcast News,” ICASSP’02, 1:5-8, Apr 2002.
                2000, and early 2001. With this system, the word er-                   [3] J.L. Gauvain, L. Lamel, G. Adda, “Partitioning and Tran-
                ror is about 20% for dev03, and 24% for the other data                     scription of Broadcast News Data,” ICSLP’98, 5:1335-
                sets. The second entry (LDC AM) gives the word error                       1338, Dec 1998.
                rates with the acoustic models trained only on the LDC                 [4] J.L. Gauvain, L. Lamel, G. Adda, “The LIMSI Broad-
                TDT4 and FBIS data. The word error is lower for the                        cast News Transcription System,” Speech Communica-
                dev03 data, which can be attributed to the training and                    tion, 37(1-2):89-108, May 2002.
                developmentdatabeingfromthesamesources. Theerror                       [5] J.L. Gauvain, C.H. Lee, “Maximum A Posteriori for
                                                                                           Multivariate Gaussain Mixture Observation of Markov
                rates are somewhat higher on the other test sets. Pooling                  Chains,” IEEE Trans. on Speech and Audio Processing,
                the audio training data, as done for the primary system                    2(2):291-298, Apr 1994.
                acoustic models, gives lower word error rates, and also                [6] L. Lamel, J.L. Gauvain, “Automatic Processing of
                exhibits less variation across the test sets. The remain-                  Broadcast Audio in Multiple Languages,” Eusipco’02,
                ing entries show the effects of other changes to the sys-                  Sep2002.
                tem. A new word list was selected using an automatic                   [7] C.J. Leggetter, P.C. Woodland, “Maximum likelihood lin-
                method, that did not necessarily include all words in the                  ear regression for speaker adaptation of continuous den-
                audio transcripts. Incorporating MLLT feature normal-                      sity hidden Markov models,” Computer Speech and Lan-
                                                                                           guage, 9(2):171-185, 1995.
                ization and CMLLR resulted in a gain of over 1% abso-                  [8] L.Mangu,E.Brill,A.Stolke,“FindingConsensusAmong
                lute on most of the data sets. Finally, the language model                 Words: Lattice-Based Word Error Minimization,” Eu-
                and word list were updated using the Gigaword corpus                       rospeeech’99, 495-498, Sep 1999.
                which also included more recent training texts, and pro-               [9] A. Messaoudi, L. Lamel, J.L. Gauvain, “Transcription of
                nunciation probabilities were used during the consensus                    Arabic Broadcast News,” ICSLP’04, Oct 2004.
                network decoding stage, resulting in a word error rate of            [10] P.C.Woodland,T.Neieler,E.Whittaker,”LanguageMod-
                16.0% on the dev04 data and 18.5% on eval04. This en-                      eling in the HTK Hub5 LVCSR,” presented at the 1998
                try corresponds to our primary system submission. The                      Hub5EWorkshop,Sep1998.
The words contained in this file might help you see if this file matches what you are looking for:

...Modeling vowels for arabic bn transcription abdel messaoudi lori lamel and jean luc gauvain spokenlanguageprocessing group limsi cnrs bp orsaycedex france fr abstract given root produced by appending articles the to this paper describes broadcast news sys from with word beginning possessives temwhichproducesavowelizedwordtranscription under ours theirs on end right left na x system evaluated in nist rt f evaluation uses a ture of texts required modication text pass decoding strategy gender bandwidth specic processing utilities written are large non acoustic models vowelized k class pronunciation meaningthat short gemination lexicon gram language model order marks not indicated there typically several pos explicitly represent forms each sible generally semantically linked vowelizations entry is considered as regrouping which spoken nal all its associated vowelvariesasafunctionofthewordcontext andthis since almost exclusively without vowel or n sequence often pronounced an important chal...

no reviews yet
Please Login to review.