Processing Pdf 180523 | Nikhil Survey Disfluency Correction 2021

Partial capture of text on file.

Survey: Exploring Disﬂuencies for Speech To Text Machine Translation
Nikhil Saini, Preethi Jyothi and Pushpak Bhattacharyya
Department of Computer Science and Engineering
Indian Institute of Technology Bombay
Mumbai,India
{nikhilra, pjyothi, pb}@cse.iitb.ac.in
Abstract chat-bots, search engines, ﬁtness apps, sleep moni-
toring, spam detection in email, and many more.
Spoken language is different from the written In Natural Language Processing (NLP), it be-
language in its style and structure. Disﬂuen- comesmoreandmorecriticaltodeal with sponta-
cies that appear in transcriptions from speech neous speech, such as dialogs between two people
recognition systems generally hamper the per- or even multi-party meetings. The goal of this
formance of downstream NLP tasks. Thus, a processing can be translation, text summarization,
disﬂuency correction system that converts dis- spoken language translation, real-time audio dub-
ﬂuent to ﬂuent text is of great value. This
survey paper talks about disﬂuencies present bing or subtitle generation, or simply the archiving
in speech and its transcriptions. Later, we of a dialog or a meeting in a written form.
describe methodologies to correct disﬂuencies Disﬂuencies are disruptions to the regular ﬂow
present in the transcriptions of a spoken ut- of speech, typically occurring in conversational
terance via various approaches viz, a) style speech. They include ﬁller pauses such as uh and
transfer for disﬂuency correction b) transfer um, word repetitions, irregular elongations, dis-
learning and language model pretraining. We course markers, conjunctions, and restarts. For
observe that disﬂuency inherent speech phe- example, the disﬂuent sentence “well we’re actu-
nomenonanditscorrectioniscrucialfordown-
stream NLP tasks. ally uh we’re getting ready” has its ﬂuent form as,
“we’re getting ready”. Here, the words highlighted
1 Introduction in green, blue and red refer to discourse, ﬁller and
restart disﬂuencies, respectively.
Natural Language and Speech Processing strives to Disﬂuencies in the text can alter its syntactic and
build machines that understand, respond and gen- semantic structure, thereby adversely affecting the
erate text and voice data in the same way humans performance of downstream NLP tasks such as in-
do. NLP and speech come under the umbrella of formation extraction, summarization, translation,
Artiﬁcial Intelligence, which is a branch of com- and parsing (Charniak and Johnson, 2001; Johnson
puter science. NLP and Speech processing has and Charniak, 2004). These tasks also employ pre-
comealongwayfromrule-basedsystemstotradi- trained language models that are typically trained
tional statistical systems to machine learning and to expect ﬂuent text. This motivates the need for
deep learning based systems. The NLP and speech disﬂuency correction systems that convert disﬂu-
systems enable machines to understand the whole ent to ﬂuent text. Prior work has predominantly
meaning of text or speech and the intent and senti- focused on the problem of disﬂuency detection (Za-
ment of the writer or speaker. yats et al., 2016; Wang et al., 2018; Dong et al.,
Natural language and speech processing is the 2019). The effect is profound for pre-trained lan-
driving force behind several computer programs guage models (Devlin et al., 2019; Edunov et al.,
like those which translate text from one language to 2018) that are typically trained to expect ﬂuent
another, respond to spoken commands from users, language. Various systems such as System User In-
correct spelling, grammar and prompts suggestions terfaces and speech-to-speech translations systems
onkeyboards, recommends movies and shows on suffer due to disﬂuencies. Additionally, it is cru-
streaming websites, recommends products in e- cial to model disﬂuencies for different higher-level
shoppingwebsites,speechtotextdictationsystems, natural language processing tasks such as informa-
tion extraction, summarization, parsing from tran- 2. Incorrect grammar in the spoken ut-
scribed textual inputs. In the tasks of parsing and terance: Often, speakers do not care
machine translation (Rao et al., 2007), it has been much about exact grammar when
observed that disﬂuencies adversely affect perfor- communicating via speech. This in-
mance. MostoftheexistingNLPtools,suchaspre- troduces irregularity in the utterance.
trained language models (Devlin et al., 2019) and Incorrect Grammar: “i are getting ready”
translators (Edunov et al., 2018) are developed for
well-formed ﬂuent text without considerations of 3. Incomplete utterances: Automatic speech
disﬂuency. Therefore, in spite of their very high ac- recognition systems generate transcriptions by
curacyonﬂuenttext,utilizingthemforsolutionson segmenting input speech into ﬁxed slots (say
disﬂuent(transcribedfromspoken)textisrelatively 5 seconds). It leads to creation of utterance
less accurate. For example, to predict sentiment in that can be the beginning, middle or end of an
customer care scenario, we could potentially use utterance. Downstream NLP tasks aren’t com-
pre-trained language models and sentence classi- patible handling incomplete utterances. The
ﬁers, if we could make the transcribed text nearly related task is known as sentence boundary
ﬂuent. detection in asr transcriptions.
Incomplete utterance:
2 Disﬂuency “and i told her to create”
2.1 Conversational Speech 4. Other errors introduced via ASR system:
In contrast to texts which are well-formed like ASR systems introduce other errors due
in newspapers, Wikipedia pages, blogs, books, to several factors like speaker variabilities
manuscripts, formal letters/documents, etc., conver- (change in voice due to age, illness, tired-
sational/spontaneous speech has a very high degree ness, etc.), spoken language variabilities
of freedom and includes a very high number of ut- (pronunciation variation due to dialects
terances which are not ﬂuent/clean. The elements and co-articulation), mismatch factors (i.e.,
that make an utterance non-ﬂuent are termed as mismatch in recording conditions between
disﬂuencies. training and testing data).
Disﬂuentspeechanditsdisﬂuenttranscriptionspos-
sess problems for various downstream NLP tasks.
Mainly, all downstream NLP tasks deal with text 2.2 Surface Structure of Disﬂuencies
which is well-formed and formatted. Therefore, it In this section, a pattern is described which
is difﬁcult for such models to incorporate the irreg- demonstrates the structure of disﬂuencies. These
ularities present in the speech data in the form of patterns are called the surface structure of disﬂu-
disﬂuencies. Moreover, since speech is becoming encies as only characteristics of disﬂuencies are
very important looking at the linguistic geography, considered, observable from the text. A disﬂuency
it is of utmost importance to remove irregularities can be divided into three parts: The reparan-
present in speech utterances so that a clean utter- dum, then there is an interruption point, after
ance can be utilized by other NLP applications like whichcomestheInterregnum,followedbyrepair.
Machine Translation, Speech To Speech Transla-
tion, Summarization, Question Answering, etc. Figure 1 shows a breakdown example. The
Theproblemspertaining to transcripts of conver- reparandum contains those words, which are
sational speech can be broadly summarized as (but originally not intended to be in the utterance.
not limited to): Thus it consists of one or more words that will
1. Presence of disﬂuent terms/phrases: be repeated or corrected ultimately (in case of a
Spoken utterances usually contain various repetition/correction) or abandoned completely
disﬂuent terms in a single utterance, which (in case of a false start). The interruption point
the speaker didn’t intend to speak and must marks the offset of the reparandum. It is not
be processed before using in a downstream connected with any pause or audible phenomenon.
NLPtask. The interregnum can consist of an editing term,
Disﬂuent: a non lexicalized pause like uh or uhm or simply
“well we’re actually uh we’re getting ready” of an empty pause, i.e. a short moment of silence.
In many cases however, the interregnum of a can vary slightly from corpus to corpus. Disﬂu-
disﬂuency is empty and the repair follows directly encies can be divided into two sub-groups, viz.,
after the reparandum. In the repair the words from simpler and complex disﬂuencies. Filled pauses
the reparandum are ﬁnally corrected or repeated like oh, uh, um and discourse markers like yeah,
(repetition/correction) or a complete new sentence well, okay, you know are considered as simpler
is started (false start). Note that in the latter case, disﬂuencies. Sometimes, single word discourse
the extension of the repair can not be determined. markers like the word Yeah in the sentence ”Yeah,
we are leaving now.”, is considered as a ﬁlled
The three terms reparandum, interregnum, pause. We differentiate between ﬁller words
and repair can be used to explain repetitions, false and discourse markers, even in single word
starts, and editing terms. The reparandum and occurrences, as this distinction is also present
interregnum can be empty in a disﬂuent sentence. in the annotated switchboard corpus. Now, we
This situation ﬁts the criteria for three different will look into complex disﬂuency types, viz.,
disﬂuency types, viz., discourse markers, ﬁlled Repetition or Correction, False Start, Edit, Aside.
pauses and interjections. These three types consists For the distinction of the categories Repetition
only of interregnum. Figure 2 shows breakdown of or Correction and False Start, it is important to
interregnum being empty and Figure 3 shows the consider that the phrase which has been abandoned
breakdown of reparandum, repair being empty. is repeated with only slight or no changes in the
syntactical structure. The change can be in the
Interruption form of Insertion, Deletion, or Substitution of
Point
words. The slight or no change identiﬁes it as a
Repetition or Correction disﬂuency. On the other
Let us, okay, let us take a look here.
hand, if a completely different syntactical structure
with different semantics is chosen for the repair,
Reparandum Interregnum Repair
the observed disﬂuency is a false start.
Figure 1: Surface Structure of Disﬂuency
The disﬂuency classiﬁcation is important
and is used to determine the type of disﬂuencies
Interruption one wants to correct in the disﬂuent text. It also
Point
forms the basis for the classiﬁers one can train
to learn the disﬂuency type domain embeddings.
So we will, , we can take a look here.
Generally, the approaches do not depend on
the type of disﬂuencies, but making explicit
Reparandum Interregnum Repair
use of the annotated corpus and incorporate the
Figure 2: Disﬂuencies with empty interregnum. knowledge of speciﬁc disﬂuency types into the
modelsisbeneﬁcial. Table 1 describes the different
disﬂuency types, their deﬁnitions and examples.
Interruption
Point 4 Approaches
How about, , well, next week?
In this section, we will discuss two approaches to
correct disﬂuencies in disﬂuent text. The problem
Reparandum Interregnum Repair statement is: “Correct disﬂuencies present in tran-
Figure 3: Disﬂuencies with empty reparandum and scribed utterances (e.g.noisy ASR output) of con-
empty repair. versational speech (e.g. Telephonic conversations,
Lectures delivered, etc) by removing the “disﬂu-
3 Types of Disﬂuencies ent” part without changing the intended meaning
of the speaker.”
This section will describe the different types of 4.1 Style Transfer for Disﬂuency Correction
disﬂuencies that can be found in the disﬂuent
text. These disﬂuency types are present in the 1. Architecture
switchboard corpus. The annotation of disﬂuencies Figure 4 clearly shows the two directions of
Disﬂuency Type Description Constituents Example
Filled Pause Nonlexicalized sounds with no semantic uh, um, ah, etc We’re uh getting ready.
content.
Arestricted group of non lexicalized uh-huh, mhm,
sounds indicating afﬁrmation or negation. mm,uh-uh, 1. I dropped my phone again, ugh.
Interjection Aninterjection is a part of speech nah, oops, yikes, 2. Oops, I didn’t mean it.
that demonstrates the emotion or feeling woops, phew, alas,
of the author. blah, gee, ugh.
Wordsthat are related to the structure
of the discourse in so far that they help 1. Well, this is good.
Discourse Marker beginning or keeping a turn or serve as okay, so, well, 2. This is, you know, a pretty
acknowledgment. They do not contribute youknow,etc good report.
to the semantic content. These are also
called linking words.
Exact repetition or correction of words
previously uttered. A correction may
Restart or Correction involve substitutions, deletions or - 1. This is is a bad bad situation.
insertions of words. However, the correction 2. Are you you happy?
continues with the same idea or train of thought
started previously.
Anutterance is aborted and restarted 1. We’ll never ﬁnd a day
False Start with a new idea or train of thought. - what about next month ?
2. Yes no I’m not coming.
Phrases of words which occur after
that part of a disﬂuency which is repeated
or corrected afterwards or even abandoned Weneedtwotickets,
Edit completely. They refer explicitly to the - I’msorry, three tickets
words which just previously have been for the ﬂight to Boston.
said, indicating that they are not intended
to belong to the utterance.
Table 1: Disﬂuency Types, Description and Examples.
translation. The model obtains latent disﬂu- improve the accuracy when the different
ent and latent ﬂuent utterances from the non- sources are related.
parallel ﬂuent and disﬂuent sentences, respec- • Similarly, when the targets are related,
tively, which are further reconstructed back parameter sharing helps to improve the
into ﬂuent and disﬂuent sentences. A back- accuracy.
translation-based objective is employed, fol- • Parameter sharing allows the model to
lowed by reconstruction for both domains i.e. get beneﬁt from the learning’s through
disﬂuent and ﬂuent text. For every mini-batch the back-propagated loss of different
of training, soft translations for a domain are translation directions. Since we are only
¯ ¯
ﬁrst generated (denoted by x and y in Fig- operating on the English language in the
ure 4), and are subsequently translated back source(disﬂuent) and target(ﬂuent), it is
into their original domains to reconstruct the imperative to utilize the beneﬁt of param-
mini-batch of input sentences. The sum of eter sharing.
token-level cross-entropy losses between the Disadvantages of Parameter Sharing:
input and the reconstructed output serves as
the reconstruction loss. • Sometimes, sharing of encoders and de-
Componentsinaneural model can be shared coders leads to burdening the parameters
minimally, completely, or in a controlled fash- to learn a large representation with lim-
ion. A complete parameter sharing is done, ited space.
which treats the model as a black box for This bottleneck can be avoided by increasing
both translation directions and offers maxi- the layers in both encoders and decoders. The
mum simplicity. Advantages of Parameter encoders and decoders are shared for both
Sharing: translation directions; disﬂuent-to-ﬂuent and
• In sequence to sequence tasks, sharing ﬂuent-to-disﬂuent. In a sequence to sequence
parameters between encoders helps to transduction task, the encoder takes an input

The words contained in this file might help you see if this file matches what you are looking for:

...Survey exploring disuencies for speech to text machine translation nikhil saini preethi jyothi and pushpak bhattacharyya department of computer science engineering indian institute technology bombay mumbai india nikhilra pjyothi pb cse iitb ac in abstract chat bots search engines tness apps sleep moni toring spam detection email many more spoken language is different from the written natural processing nlp it be its style structure disuen comesmoreandmorecriticaltodeal with sponta cies that appear transcriptions neous such as dialogs between two people recognition systems generally hamper per or even multi party meetings goal this formance downstream tasks thus a can summarization disuency correction system converts dis real time audio dub uent great value paper talks about present bing subtitle generation simply archiving later we dialog meeting form describe methodologies correct are disruptions regular ow ut typically occurring conversational terance via various approaches viz they ...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area