107x Filetype PDF File size 0.47 MB Source: www.cfilt.iitb.ac.in
Survey: Exploring Disfluencies for Speech To Text Machine Translation Nikhil Saini, Preethi Jyothi and Pushpak Bhattacharyya Department of Computer Science and Engineering Indian Institute of Technology Bombay Mumbai,India {nikhilra, pjyothi, pb}@cse.iitb.ac.in Abstract chat-bots, search engines, fitness apps, sleep moni- toring, spam detection in email, and many more. Spoken language is different from the written In Natural Language Processing (NLP), it be- language in its style and structure. Disfluen- comesmoreandmorecriticaltodeal with sponta- cies that appear in transcriptions from speech neous speech, such as dialogs between two people recognition systems generally hamper the per- or even multi-party meetings. The goal of this formance of downstream NLP tasks. Thus, a processing can be translation, text summarization, disfluency correction system that converts dis- spoken language translation, real-time audio dub- fluent to fluent text is of great value. This survey paper talks about disfluencies present bing or subtitle generation, or simply the archiving in speech and its transcriptions. Later, we of a dialog or a meeting in a written form. describe methodologies to correct disfluencies Disfluencies are disruptions to the regular flow present in the transcriptions of a spoken ut- of speech, typically occurring in conversational terance via various approaches viz, a) style speech. They include filler pauses such as uh and transfer for disfluency correction b) transfer um, word repetitions, irregular elongations, dis- learning and language model pretraining. We course markers, conjunctions, and restarts. For observe that disfluency inherent speech phe- example, the disfluent sentence “well we’re actu- nomenonanditscorrectioniscrucialfordown- stream NLP tasks. ally uh we’re getting ready” has its fluent form as, “we’re getting ready”. Here, the words highlighted 1 Introduction in green, blue and red refer to discourse, filler and restart disfluencies, respectively. Natural Language and Speech Processing strives to Disfluencies in the text can alter its syntactic and build machines that understand, respond and gen- semantic structure, thereby adversely affecting the erate text and voice data in the same way humans performance of downstream NLP tasks such as in- do. NLP and speech come under the umbrella of formation extraction, summarization, translation, Artificial Intelligence, which is a branch of com- and parsing (Charniak and Johnson, 2001; Johnson puter science. NLP and Speech processing has and Charniak, 2004). These tasks also employ pre- comealongwayfromrule-basedsystemstotradi- trained language models that are typically trained tional statistical systems to machine learning and to expect fluent text. This motivates the need for deep learning based systems. The NLP and speech disfluency correction systems that convert disflu- systems enable machines to understand the whole ent to fluent text. Prior work has predominantly meaning of text or speech and the intent and senti- focused on the problem of disfluency detection (Za- ment of the writer or speaker. yats et al., 2016; Wang et al., 2018; Dong et al., Natural language and speech processing is the 2019). The effect is profound for pre-trained lan- driving force behind several computer programs guage models (Devlin et al., 2019; Edunov et al., like those which translate text from one language to 2018) that are typically trained to expect fluent another, respond to spoken commands from users, language. Various systems such as System User In- correct spelling, grammar and prompts suggestions terfaces and speech-to-speech translations systems onkeyboards, recommends movies and shows on suffer due to disfluencies. Additionally, it is cru- streaming websites, recommends products in e- cial to model disfluencies for different higher-level shoppingwebsites,speechtotextdictationsystems, natural language processing tasks such as informa- tion extraction, summarization, parsing from tran- 2. Incorrect grammar in the spoken ut- scribed textual inputs. In the tasks of parsing and terance: Often, speakers do not care machine translation (Rao et al., 2007), it has been much about exact grammar when observed that disfluencies adversely affect perfor- communicating via speech. This in- mance. MostoftheexistingNLPtools,suchaspre- troduces irregularity in the utterance. trained language models (Devlin et al., 2019) and Incorrect Grammar: “i are getting ready” translators (Edunov et al., 2018) are developed for well-formed fluent text without considerations of 3. Incomplete utterances: Automatic speech disfluency. Therefore, in spite of their very high ac- recognition systems generate transcriptions by curacyonfluenttext,utilizingthemforsolutionson segmenting input speech into fixed slots (say disfluent(transcribedfromspoken)textisrelatively 5 seconds). It leads to creation of utterance less accurate. For example, to predict sentiment in that can be the beginning, middle or end of an customer care scenario, we could potentially use utterance. Downstream NLP tasks aren’t com- pre-trained language models and sentence classi- patible handling incomplete utterances. The fiers, if we could make the transcribed text nearly related task is known as sentence boundary fluent. detection in asr transcriptions. Incomplete utterance: 2 Disfluency “and i told her to create” 2.1 Conversational Speech 4. Other errors introduced via ASR system: In contrast to texts which are well-formed like ASR systems introduce other errors due in newspapers, Wikipedia pages, blogs, books, to several factors like speaker variabilities manuscripts, formal letters/documents, etc., conver- (change in voice due to age, illness, tired- sational/spontaneous speech has a very high degree ness, etc.), spoken language variabilities of freedom and includes a very high number of ut- (pronunciation variation due to dialects terances which are not fluent/clean. The elements and co-articulation), mismatch factors (i.e., that make an utterance non-fluent are termed as mismatch in recording conditions between disfluencies. training and testing data). Disfluentspeechanditsdisfluenttranscriptionspos- sess problems for various downstream NLP tasks. Mainly, all downstream NLP tasks deal with text 2.2 Surface Structure of Disfluencies which is well-formed and formatted. Therefore, it In this section, a pattern is described which is difficult for such models to incorporate the irreg- demonstrates the structure of disfluencies. These ularities present in the speech data in the form of patterns are called the surface structure of disflu- disfluencies. Moreover, since speech is becoming encies as only characteristics of disfluencies are very important looking at the linguistic geography, considered, observable from the text. A disfluency it is of utmost importance to remove irregularities can be divided into three parts: The reparan- present in speech utterances so that a clean utter- dum, then there is an interruption point, after ance can be utilized by other NLP applications like whichcomestheInterregnum,followedbyrepair. Machine Translation, Speech To Speech Transla- tion, Summarization, Question Answering, etc. Figure 1 shows a breakdown example. The Theproblemspertaining to transcripts of conver- reparandum contains those words, which are sational speech can be broadly summarized as (but originally not intended to be in the utterance. not limited to): Thus it consists of one or more words that will 1. Presence of disfluent terms/phrases: be repeated or corrected ultimately (in case of a Spoken utterances usually contain various repetition/correction) or abandoned completely disfluent terms in a single utterance, which (in case of a false start). The interruption point the speaker didn’t intend to speak and must marks the offset of the reparandum. It is not be processed before using in a downstream connected with any pause or audible phenomenon. NLPtask. The interregnum can consist of an editing term, Disfluent: a non lexicalized pause like uh or uhm or simply “well we’re actually uh we’re getting ready” of an empty pause, i.e. a short moment of silence. In many cases however, the interregnum of a can vary slightly from corpus to corpus. Disflu- disfluency is empty and the repair follows directly encies can be divided into two sub-groups, viz., after the reparandum. In the repair the words from simpler and complex disfluencies. Filled pauses the reparandum are finally corrected or repeated like oh, uh, um and discourse markers like yeah, (repetition/correction) or a complete new sentence well, okay, you know are considered as simpler is started (false start). Note that in the latter case, disfluencies. Sometimes, single word discourse the extension of the repair can not be determined. markers like the word Yeah in the sentence ”Yeah, we are leaving now.”, is considered as a filled The three terms reparandum, interregnum, pause. We differentiate between filler words and repair can be used to explain repetitions, false and discourse markers, even in single word starts, and editing terms. The reparandum and occurrences, as this distinction is also present interregnum can be empty in a disfluent sentence. in the annotated switchboard corpus. Now, we This situation fits the criteria for three different will look into complex disfluency types, viz., disfluency types, viz., discourse markers, filled Repetition or Correction, False Start, Edit, Aside. pauses and interjections. These three types consists For the distinction of the categories Repetition only of interregnum. Figure 2 shows breakdown of or Correction and False Start, it is important to interregnum being empty and Figure 3 shows the consider that the phrase which has been abandoned breakdown of reparandum, repair being empty. is repeated with only slight or no changes in the syntactical structure. The change can be in the Interruption form of Insertion, Deletion, or Substitution of Point words. The slight or no change identifies it as a Repetition or Correction disfluency. On the other Let us, okay, let us take a look here. hand, if a completely different syntactical structure with different semantics is chosen for the repair, Reparandum Interregnum Repair the observed disfluency is a false start. Figure 1: Surface Structure of Disfluency The disfluency classification is important and is used to determine the type of disfluencies Interruption one wants to correct in the disfluent text. It also Point forms the basis for the classifiers one can train to learn the disfluency type domain embeddings. So we will, , we can take a look here. Generally, the approaches do not depend on the type of disfluencies, but making explicit Reparandum Interregnum Repair use of the annotated corpus and incorporate the Figure 2: Disfluencies with empty interregnum. knowledge of specific disfluency types into the modelsisbeneficial. Table 1 describes the different disfluency types, their definitions and examples. Interruption Point 4 Approaches How about, , well, next week? In this section, we will discuss two approaches to correct disfluencies in disfluent text. The problem Reparandum Interregnum Repair statement is: “Correct disfluencies present in tran- Figure 3: Disfluencies with empty reparandum and scribed utterances (e.g.noisy ASR output) of con- empty repair. versational speech (e.g. Telephonic conversations, Lectures delivered, etc) by removing the “disflu- 3 Types of Disfluencies ent” part without changing the intended meaning of the speaker.” This section will describe the different types of 4.1 Style Transfer for Disfluency Correction disfluencies that can be found in the disfluent text. These disfluency types are present in the 1. Architecture switchboard corpus. The annotation of disfluencies Figure 4 clearly shows the two directions of Disfluency Type Description Constituents Example Filled Pause Nonlexicalized sounds with no semantic uh, um, ah, etc We’re uh getting ready. content. Arestricted group of non lexicalized uh-huh, mhm, sounds indicating affirmation or negation. mm,uh-uh, 1. I dropped my phone again, ugh. Interjection Aninterjection is a part of speech nah, oops, yikes, 2. Oops, I didn’t mean it. that demonstrates the emotion or feeling woops, phew, alas, of the author. blah, gee, ugh. Wordsthat are related to the structure of the discourse in so far that they help 1. Well, this is good. Discourse Marker beginning or keeping a turn or serve as okay, so, well, 2. This is, you know, a pretty acknowledgment. They do not contribute youknow,etc good report. to the semantic content. These are also called linking words. Exact repetition or correction of words previously uttered. A correction may Restart or Correction involve substitutions, deletions or - 1. This is is a bad bad situation. insertions of words. However, the correction 2. Are you you happy? continues with the same idea or train of thought started previously. Anutterance is aborted and restarted 1. We’ll never find a day False Start with a new idea or train of thought. - what about next month ? 2. Yes no I’m not coming. Phrases of words which occur after that part of a disfluency which is repeated or corrected afterwards or even abandoned Weneedtwotickets, Edit completely. They refer explicitly to the - I’msorry, three tickets words which just previously have been for the flight to Boston. said, indicating that they are not intended to belong to the utterance. Table 1: Disfluency Types, Description and Examples. translation. The model obtains latent disflu- improve the accuracy when the different ent and latent fluent utterances from the non- sources are related. parallel fluent and disfluent sentences, respec- • Similarly, when the targets are related, tively, which are further reconstructed back parameter sharing helps to improve the into fluent and disfluent sentences. A back- accuracy. translation-based objective is employed, fol- • Parameter sharing allows the model to lowed by reconstruction for both domains i.e. get benefit from the learning’s through disfluent and fluent text. For every mini-batch the back-propagated loss of different of training, soft translations for a domain are translation directions. Since we are only ¯ ¯ first generated (denoted by x and y in Fig- operating on the English language in the ure 4), and are subsequently translated back source(disfluent) and target(fluent), it is into their original domains to reconstruct the imperative to utilize the benefit of param- mini-batch of input sentences. The sum of eter sharing. token-level cross-entropy losses between the Disadvantages of Parameter Sharing: input and the reconstructed output serves as the reconstruction loss. • Sometimes, sharing of encoders and de- Componentsinaneural model can be shared coders leads to burdening the parameters minimally, completely, or in a controlled fash- to learn a large representation with lim- ion. A complete parameter sharing is done, ited space. which treats the model as a black box for This bottleneck can be avoided by increasing both translation directions and offers maxi- the layers in both encoders and decoders. The mum simplicity. Advantages of Parameter encoders and decoders are shared for both Sharing: translation directions; disfluent-to-fluent and • In sequence to sequence tasks, sharing fluent-to-disfluent. In a sequence to sequence parameters between encoders helps to transduction task, the encoder takes an input
no reviews yet
Please Login to review.