jagomart
digital resources
picture1_Processing Pdf 180523 | Nikhil Survey Disfluency Correction 2021


 107x       Filetype PDF       File size 0.47 MB       Source: www.cfilt.iitb.ac.in


File: Processing Pdf 180523 | Nikhil Survey Disfluency Correction 2021
survey exploring disuencies for speech to text machine translation nikhil saini preethi jyothi and pushpak bhattacharyya department of computer science and engineering indian institute of technology bombay mumbai india nikhilra ...

icon picture PDF Filetype PDF | Posted on 30 Jan 2023 | 2 years ago
Partial capture of text on file.
                    Survey: Exploring Disfluencies for Speech To Text Machine Translation
                                       Nikhil Saini, Preethi Jyothi and Pushpak Bhattacharyya
                                             Department of Computer Science and Engineering
                                                   Indian Institute of Technology Bombay
                                                                  Mumbai,India
                                         {nikhilra, pjyothi, pb}@cse.iitb.ac.in
                                       Abstract                             chat-bots, search engines, fitness apps, sleep moni-
                                                                            toring, spam detection in email, and many more.
                      Spoken language is different from the written            In Natural Language Processing (NLP), it be-
                      language in its style and structure. Disfluen-         comesmoreandmorecriticaltodeal with sponta-
                      cies that appear in transcriptions from speech        neous speech, such as dialogs between two people
                      recognition systems generally hamper the per-         or even multi-party meetings. The goal of this
                      formance of downstream NLP tasks. Thus, a             processing can be translation, text summarization,
                      disfluency correction system that converts dis-        spoken language translation, real-time audio dub-
                      fluent to fluent text is of great value.   This
                      survey paper talks about disfluencies present          bing or subtitle generation, or simply the archiving
                      in speech and its transcriptions.   Later, we         of a dialog or a meeting in a written form.
                      describe methodologies to correct disfluencies            Disfluencies are disruptions to the regular flow
                      present in the transcriptions of a spoken ut-         of speech, typically occurring in conversational
                      terance via various approaches viz, a) style          speech. They include filler pauses such as uh and
                      transfer for disfluency correction b) transfer         um, word repetitions, irregular elongations, dis-
                      learning and language model pretraining. We           course markers, conjunctions, and restarts. For
                      observe that disfluency inherent speech phe-           example, the disfluent sentence “well we’re actu-
                      nomenonanditscorrectioniscrucialfordown-
                      stream NLP tasks.                                     ally uh we’re getting ready” has its fluent form as,
                                                                            “we’re getting ready”. Here, the words highlighted
                  1    Introduction                                         in green, blue and red refer to discourse, filler and
                                                                            restart disfluencies, respectively.
                  Natural Language and Speech Processing strives to            Disfluencies in the text can alter its syntactic and
                  build machines that understand, respond and gen-          semantic structure, thereby adversely affecting the
                  erate text and voice data in the same way humans          performance of downstream NLP tasks such as in-
                  do. NLP and speech come under the umbrella of             formation extraction, summarization, translation,
                  Artificial Intelligence, which is a branch of com-         and parsing (Charniak and Johnson, 2001; Johnson
                  puter science. NLP and Speech processing has              and Charniak, 2004). These tasks also employ pre-
                  comealongwayfromrule-basedsystemstotradi-                 trained language models that are typically trained
                  tional statistical systems to machine learning and        to expect fluent text. This motivates the need for
                  deep learning based systems. The NLP and speech           disfluency correction systems that convert disflu-
                  systems enable machines to understand the whole           ent to fluent text. Prior work has predominantly
                  meaning of text or speech and the intent and senti-       focused on the problem of disfluency detection (Za-
                  ment of the writer or speaker.                            yats et al., 2016; Wang et al., 2018; Dong et al.,
                     Natural language and speech processing is the          2019). The effect is profound for pre-trained lan-
                  driving force behind several computer programs            guage models (Devlin et al., 2019; Edunov et al.,
                  like those which translate text from one language to      2018) that are typically trained to expect fluent
                  another, respond to spoken commands from users,           language. Various systems such as System User In-
                  correct spelling, grammar and prompts suggestions         terfaces and speech-to-speech translations systems
                  onkeyboards, recommends movies and shows on               suffer due to disfluencies. Additionally, it is cru-
                  streaming websites, recommends products in e-             cial to model disfluencies for different higher-level
                  shoppingwebsites,speechtotextdictationsystems,            natural language processing tasks such as informa-
                  tion extraction, summarization, parsing from tran-         2. Incorrect grammar in the spoken ut-
                  scribed textual inputs. In the tasks of parsing and           terance:     Often,   speakers do not care
                  machine translation (Rao et al., 2007), it has been           much      about    exact     grammar      when
                  observed that disfluencies adversely affect perfor-            communicating via speech.             This in-
                  mance. MostoftheexistingNLPtools,suchaspre-                   troduces    irregularity   in   the   utterance.
                  trained language models (Devlin et al., 2019) and              Incorrect Grammar: “i are getting ready”
                  translators (Edunov et al., 2018) are developed for
                  well-formed fluent text without considerations of           3. Incomplete utterances: Automatic speech
                  disfluency. Therefore, in spite of their very high ac-         recognition systems generate transcriptions by
                  curacyonfluenttext,utilizingthemforsolutionson                 segmenting input speech into fixed slots (say
                  disfluent(transcribedfromspoken)textisrelatively               5 seconds). It leads to creation of utterance
                  less accurate. For example, to predict sentiment in           that can be the beginning, middle or end of an
                  customer care scenario, we could potentially use              utterance. Downstream NLP tasks aren’t com-
                  pre-trained language models and sentence classi-              patible handling incomplete utterances. The
                  fiers, if we could make the transcribed text nearly            related task is known as sentence boundary
                  fluent.                                                        detection in asr transcriptions.
                                                                                Incomplete utterance:
                  2   Disfluency                                                  “and i told her to create”
                  2.1   Conversational Speech                                4. Other errors introduced via ASR system:
                  In contrast to texts which are well-formed like               ASR systems introduce other errors due
                  in newspapers, Wikipedia pages, blogs, books,                 to several factors like speaker variabilities
                  manuscripts, formal letters/documents, etc., conver-          (change in voice due to age, illness, tired-
                  sational/spontaneous speech has a very high degree            ness, etc.), spoken language variabilities
                  of freedom and includes a very high number of ut-             (pronunciation variation due to dialects
                  terances which are not fluent/clean. The elements              and co-articulation), mismatch factors (i.e.,
                  that make an utterance non-fluent are termed as                mismatch in recording conditions between
                  disfluencies.                                                  training and testing data).
                  Disfluentspeechanditsdisfluenttranscriptionspos-
                  sess problems for various downstream NLP tasks.
                  Mainly, all downstream NLP tasks deal with text          2.2   Surface Structure of Disfluencies
                  which is well-formed and formatted. Therefore, it        In this section, a pattern is described which
                  is difficult for such models to incorporate the irreg-    demonstrates the structure of disfluencies. These
                  ularities present in the speech data in the form of      patterns are called the surface structure of disflu-
                  disfluencies. Moreover, since speech is becoming          encies as only characteristics of disfluencies are
                  very important looking at the linguistic geography,      considered, observable from the text. A disfluency
                  it is of utmost importance to remove irregularities      can be divided into three parts: The reparan-
                  present in speech utterances so that a clean utter-      dum, then there is an interruption point, after
                  ance can be utilized by other NLP applications like      whichcomestheInterregnum,followedbyrepair.
                  Machine Translation, Speech To Speech Transla-
                  tion, Summarization, Question Answering, etc.            Figure 1 shows a breakdown example.              The
                     Theproblemspertaining to transcripts of conver-       reparandum contains those words, which are
                  sational speech can be broadly summarized as (but        originally not intended to be in the utterance.
                  not limited to):                                         Thus it consists of one or more words that will
                    1. Presence     of    disfluent    terms/phrases:       be repeated or corrected ultimately (in case of a
                       Spoken utterances usually contain various           repetition/correction) or abandoned completely
                       disfluent terms in a single utterance, which         (in case of a false start). The interruption point
                       the speaker didn’t intend to speak and must         marks the offset of the reparandum.        It is not
                       be processed before using in a downstream           connected with any pause or audible phenomenon.
                       NLPtask.                                            The interregnum can consist of an editing term,
                       Disfluent:                                           a non lexicalized pause like uh or uhm or simply
                         “well we’re actually uh we’re getting ready”      of an empty pause, i.e. a short moment of silence.
                   In many cases however, the interregnum of a               can vary slightly from corpus to corpus. Disflu-
                   disfluency is empty and the repair follows directly        encies can be divided into two sub-groups, viz.,
                   after the reparandum. In the repair the words from        simpler and complex disfluencies. Filled pauses
                   the reparandum are finally corrected or repeated           like oh, uh, um and discourse markers like yeah,
                  (repetition/correction) or a complete new sentence         well, okay, you know are considered as simpler
                   is started (false start). Note that in the latter case,   disfluencies. Sometimes, single word discourse
                   the extension of the repair can not be determined.        markers like the word Yeah in the sentence ”Yeah,
                                                                             we are leaving now.”, is considered as a filled
                  The three terms reparandum,              interregnum,      pause.     We differentiate between filler words
                   and repair can be used to explain repetitions, false      and discourse markers, even in single word
                   starts, and editing terms. The reparandum and             occurrences, as this distinction is also present
                   interregnum can be empty in a disfluent sentence.          in the annotated switchboard corpus. Now, we
                  This situation fits the criteria for three different        will look into complex disfluency types, viz.,
                   disfluency types, viz., discourse markers, filled           Repetition or Correction, False Start, Edit, Aside.
                   pauses and interjections. These three types consists      For the distinction of the categories Repetition
                   only of interregnum. Figure 2 shows breakdown of          or Correction and False Start, it is important to
                   interregnum being empty and Figure 3 shows the            consider that the phrase which has been abandoned
                   breakdown of reparandum, repair being empty.              is repeated with only slight or no changes in the
                                                                             syntactical structure. The change can be in the
                                 Interruption                                form of Insertion, Deletion, or Substitution of
                                   Point
                                                                             words. The slight or no change identifies it as a
                                                                             Repetition or Correction disfluency. On the other
                                Let us, okay, let us take a look here.
                                                                             hand, if a completely different syntactical structure
                                                                             with different semantics is chosen for the repair,
                       Reparandum  Interregnum   Repair
                                                                             the observed disfluency is a false start.
                         Figure 1: Surface Structure of Disfluency
                                                                             The     disfluency     classification    is   important
                                                                             and is used to determine the type of disfluencies
                                 Interruption                                one wants to correct in the disfluent text. It also
                                   Point
                                                                             forms the basis for the classifiers one can train
                                                                             to learn the disfluency type domain embeddings.
                             So we will,          , we can take a look here.
                                                                             Generally, the approaches do not depend on
                                                                             the type of disfluencies, but making explicit
                       Reparandum  Interregnum   Repair
                                                                             use of the annotated corpus and incorporate the
                      Figure 2: Disfluencies with empty interregnum.          knowledge of specific disfluency types into the
                                                                             modelsisbeneficial. Table 1 describes the different
                                                                             disfluency types, their definitions and examples.
                                       Interruption
                                          Point                              4    Approaches
                               How about,        , well,       next week?
                                                                             In this section, we will discuss two approaches to
                                                                             correct disfluencies in disfluent text. The problem
                       Reparandum  Interregnum   Repair                      statement is: “Correct disfluencies present in tran-
                   Figure 3: Disfluencies with empty reparandum and           scribed utterances (e.g.noisy ASR output) of con-
                   empty repair.                                             versational speech (e.g. Telephonic conversations,
                                                                             Lectures delivered, etc) by removing the “disflu-
                   3   Types of Disfluencies                                  ent” part without changing the intended meaning
                                                                             of the speaker.”
                  This section will describe the different types of          4.1    Style Transfer for Disfluency Correction
                   disfluencies that can be found in the disfluent
                   text.  These disfluency types are present in the              1. Architecture
                   switchboard corpus. The annotation of disfluencies               Figure 4 clearly shows the two directions of
                        Disfluency Type                      Description                     Constituents                Example
                     Filled Pause           Nonlexicalized sounds with no semantic       uh, um, ah, etc     We’re uh getting ready.
                                            content.
                                            Arestricted group of non lexicalized         uh-huh, mhm,
                                            sounds indicating affirmation or negation.    mm,uh-uh,           1. I dropped my phone again, ugh.
                     Interjection           Aninterjection is a part of speech           nah, oops, yikes,   2. Oops, I didn’t mean it.
                                            that demonstrates the emotion or feeling     woops, phew, alas,
                                            of the author.                               blah, gee, ugh.
                                            Wordsthat are related to the structure
                                            of the discourse in so far that they help                        1. Well, this is good.
                     Discourse Marker       beginning or keeping a turn or serve as      okay, so, well,     2. This is, you know, a pretty
                                            acknowledgment. They do not contribute       youknow,etc         good report.
                                            to the semantic content. These are also
                                            called linking words.
                                            Exact repetition or correction of words
                                            previously uttered. A correction may
                     Restart or Correction  involve substitutions, deletions or          -                   1. This is is a bad bad situation.
                                            insertions of words. However, the correction                     2. Are you you happy?
                                            continues with the same idea or train of thought
                                            started previously.
                                            Anutterance is aborted and restarted                             1. We’ll never find a day
                     False Start            with a new idea or train of thought.         -                   what about next month ?
                                                                                                             2. Yes no I’m not coming.
                                            Phrases of words which occur after
                                            that part of a disfluency which is repeated
                                            or corrected afterwards or even abandoned                        Weneedtwotickets,
                     Edit                   completely. They refer explicitly to the     -                   I’msorry, three tickets
                                            words which just previously have been                            for the flight to Boston.
                                            said, indicating that they are not intended
                                            to belong to the utterance.
                                                   Table 1: Disfluency Types, Description and Examples.
                          translation. The model obtains latent disflu-                        improve the accuracy when the different
                          ent and latent fluent utterances from the non-                       sources are related.
                          parallel fluent and disfluent sentences, respec-                    • Similarly, when the targets are related,
                          tively, which are further reconstructed back                        parameter sharing helps to improve the
                          into fluent and disfluent sentences. A back-                          accuracy.
                          translation-based objective is employed, fol-                     • Parameter sharing allows the model to
                          lowed by reconstruction for both domains i.e.                       get benefit from the learning’s through
                          disfluent and fluent text. For every mini-batch                       the back-propagated loss of different
                          of training, soft translations for a domain are                     translation directions. Since we are only
                                                           ¯        ¯
                          first generated (denoted by x and y in Fig-                          operating on the English language in the
                          ure 4), and are subsequently translated back                        source(disfluent) and target(fluent), it is
                          into their original domains to reconstruct the                      imperative to utilize the benefit of param-
                          mini-batch of input sentences. The sum of                           eter sharing.
                          token-level cross-entropy losses between the                  Disadvantages of Parameter Sharing:
                          input and the reconstructed output serves as
                          the reconstruction loss.                                          • Sometimes, sharing of encoders and de-
                          Componentsinaneural model can be shared                             coders leads to burdening the parameters
                          minimally, completely, or in a controlled fash-                     to learn a large representation with lim-
                          ion. A complete parameter sharing is done,                          ited space.
                         which treats the model as a black box for                      This bottleneck can be avoided by increasing
                          both translation directions and offers maxi-                  the layers in both encoders and decoders. The
                          mum simplicity. Advantages of Parameter                       encoders and decoders are shared for both
                          Sharing:                                                      translation directions; disfluent-to-fluent and
                             • In sequence to sequence tasks, sharing                   fluent-to-disfluent. In a sequence to sequence
                               parameters between encoders helps to                     transduction task, the encoder takes an input
The words contained in this file might help you see if this file matches what you are looking for:

...Survey exploring disuencies for speech to text machine translation nikhil saini preethi jyothi and pushpak bhattacharyya department of computer science engineering indian institute technology bombay mumbai india nikhilra pjyothi pb cse iitb ac in abstract chat bots search engines tness apps sleep moni toring spam detection email many more spoken language is different from the written natural processing nlp it be its style structure disuen comesmoreandmorecriticaltodeal with sponta cies that appear transcriptions neous such as dialogs between two people recognition systems generally hamper per or even multi party meetings goal this formance downstream tasks thus a can summarization disuency correction system converts dis real time audio dub uent great value paper talks about present bing subtitle generation simply archiving later we dialog meeting form describe methodologies correct are disruptions regular ow ut typically occurring conversational terance via various approaches viz they ...

no reviews yet
Please Login to review.