355x Filetype PDF File size 0.47 MB Source: www.cfilt.iitb.ac.in
Survey: Exploring Disfluencies for Speech To Text Machine Translation
Nikhil Saini, Preethi Jyothi and Pushpak Bhattacharyya
Department of Computer Science and Engineering
Indian Institute of Technology Bombay
Mumbai,India
{nikhilra, pjyothi, pb}@cse.iitb.ac.in
Abstract chat-bots, search engines, fitness apps, sleep moni-
toring, spam detection in email, and many more.
Spoken language is different from the written In Natural Language Processing (NLP), it be-
language in its style and structure. Disfluen- comesmoreandmorecriticaltodeal with sponta-
cies that appear in transcriptions from speech neous speech, such as dialogs between two people
recognition systems generally hamper the per- or even multi-party meetings. The goal of this
formance of downstream NLP tasks. Thus, a processing can be translation, text summarization,
disfluency correction system that converts dis- spoken language translation, real-time audio dub-
fluent to fluent text is of great value. This
survey paper talks about disfluencies present bing or subtitle generation, or simply the archiving
in speech and its transcriptions. Later, we of a dialog or a meeting in a written form.
describe methodologies to correct disfluencies Disfluencies are disruptions to the regular flow
present in the transcriptions of a spoken ut- of speech, typically occurring in conversational
terance via various approaches viz, a) style speech. They include filler pauses such as uh and
transfer for disfluency correction b) transfer um, word repetitions, irregular elongations, dis-
learning and language model pretraining. We course markers, conjunctions, and restarts. For
observe that disfluency inherent speech phe- example, the disfluent sentence “well we’re actu-
nomenonanditscorrectioniscrucialfordown-
stream NLP tasks. ally uh we’re getting ready” has its fluent form as,
“we’re getting ready”. Here, the words highlighted
1 Introduction in green, blue and red refer to discourse, filler and
restart disfluencies, respectively.
Natural Language and Speech Processing strives to Disfluencies in the text can alter its syntactic and
build machines that understand, respond and gen- semantic structure, thereby adversely affecting the
erate text and voice data in the same way humans performance of downstream NLP tasks such as in-
do. NLP and speech come under the umbrella of formation extraction, summarization, translation,
Artificial Intelligence, which is a branch of com- and parsing (Charniak and Johnson, 2001; Johnson
puter science. NLP and Speech processing has and Charniak, 2004). These tasks also employ pre-
comealongwayfromrule-basedsystemstotradi- trained language models that are typically trained
tional statistical systems to machine learning and to expect fluent text. This motivates the need for
deep learning based systems. The NLP and speech disfluency correction systems that convert disflu-
systems enable machines to understand the whole ent to fluent text. Prior work has predominantly
meaning of text or speech and the intent and senti- focused on the problem of disfluency detection (Za-
ment of the writer or speaker. yats et al., 2016; Wang et al., 2018; Dong et al.,
Natural language and speech processing is the 2019). The effect is profound for pre-trained lan-
driving force behind several computer programs guage models (Devlin et al., 2019; Edunov et al.,
like those which translate text from one language to 2018) that are typically trained to expect fluent
another, respond to spoken commands from users, language. Various systems such as System User In-
correct spelling, grammar and prompts suggestions terfaces and speech-to-speech translations systems
onkeyboards, recommends movies and shows on suffer due to disfluencies. Additionally, it is cru-
streaming websites, recommends products in e- cial to model disfluencies for different higher-level
shoppingwebsites,speechtotextdictationsystems, natural language processing tasks such as informa-
tion extraction, summarization, parsing from tran- 2. Incorrect grammar in the spoken ut-
scribed textual inputs. In the tasks of parsing and terance: Often, speakers do not care
machine translation (Rao et al., 2007), it has been much about exact grammar when
observed that disfluencies adversely affect perfor- communicating via speech. This in-
mance. MostoftheexistingNLPtools,suchaspre- troduces irregularity in the utterance.
trained language models (Devlin et al., 2019) and Incorrect Grammar: “i are getting ready”
translators (Edunov et al., 2018) are developed for
well-formed fluent text without considerations of 3. Incomplete utterances: Automatic speech
disfluency. Therefore, in spite of their very high ac- recognition systems generate transcriptions by
curacyonfluenttext,utilizingthemforsolutionson segmenting input speech into fixed slots (say
disfluent(transcribedfromspoken)textisrelatively 5 seconds). It leads to creation of utterance
less accurate. For example, to predict sentiment in that can be the beginning, middle or end of an
customer care scenario, we could potentially use utterance. Downstream NLP tasks aren’t com-
pre-trained language models and sentence classi- patible handling incomplete utterances. The
fiers, if we could make the transcribed text nearly related task is known as sentence boundary
fluent. detection in asr transcriptions.
Incomplete utterance:
2 Disfluency “and i told her to create”
2.1 Conversational Speech 4. Other errors introduced via ASR system:
In contrast to texts which are well-formed like ASR systems introduce other errors due
in newspapers, Wikipedia pages, blogs, books, to several factors like speaker variabilities
manuscripts, formal letters/documents, etc., conver- (change in voice due to age, illness, tired-
sational/spontaneous speech has a very high degree ness, etc.), spoken language variabilities
of freedom and includes a very high number of ut- (pronunciation variation due to dialects
terances which are not fluent/clean. The elements and co-articulation), mismatch factors (i.e.,
that make an utterance non-fluent are termed as mismatch in recording conditions between
disfluencies. training and testing data).
Disfluentspeechanditsdisfluenttranscriptionspos-
sess problems for various downstream NLP tasks.
Mainly, all downstream NLP tasks deal with text 2.2 Surface Structure of Disfluencies
which is well-formed and formatted. Therefore, it In this section, a pattern is described which
is difficult for such models to incorporate the irreg- demonstrates the structure of disfluencies. These
ularities present in the speech data in the form of patterns are called the surface structure of disflu-
disfluencies. Moreover, since speech is becoming encies as only characteristics of disfluencies are
very important looking at the linguistic geography, considered, observable from the text. A disfluency
it is of utmost importance to remove irregularities can be divided into three parts: The reparan-
present in speech utterances so that a clean utter- dum, then there is an interruption point, after
ance can be utilized by other NLP applications like whichcomestheInterregnum,followedbyrepair.
Machine Translation, Speech To Speech Transla-
tion, Summarization, Question Answering, etc. Figure 1 shows a breakdown example. The
Theproblemspertaining to transcripts of conver- reparandum contains those words, which are
sational speech can be broadly summarized as (but originally not intended to be in the utterance.
not limited to): Thus it consists of one or more words that will
1. Presence of disfluent terms/phrases: be repeated or corrected ultimately (in case of a
Spoken utterances usually contain various repetition/correction) or abandoned completely
disfluent terms in a single utterance, which (in case of a false start). The interruption point
the speaker didn’t intend to speak and must marks the offset of the reparandum. It is not
be processed before using in a downstream connected with any pause or audible phenomenon.
NLPtask. The interregnum can consist of an editing term,
Disfluent: a non lexicalized pause like uh or uhm or simply
“well we’re actually uh we’re getting ready” of an empty pause, i.e. a short moment of silence.
In many cases however, the interregnum of a can vary slightly from corpus to corpus. Disflu-
disfluency is empty and the repair follows directly encies can be divided into two sub-groups, viz.,
after the reparandum. In the repair the words from simpler and complex disfluencies. Filled pauses
the reparandum are finally corrected or repeated like oh, uh, um and discourse markers like yeah,
(repetition/correction) or a complete new sentence well, okay, you know are considered as simpler
is started (false start). Note that in the latter case, disfluencies. Sometimes, single word discourse
the extension of the repair can not be determined. markers like the word Yeah in the sentence ”Yeah,
we are leaving now.”, is considered as a filled
The three terms reparandum, interregnum, pause. We differentiate between filler words
and repair can be used to explain repetitions, false and discourse markers, even in single word
starts, and editing terms. The reparandum and occurrences, as this distinction is also present
interregnum can be empty in a disfluent sentence. in the annotated switchboard corpus. Now, we
This situation fits the criteria for three different will look into complex disfluency types, viz.,
disfluency types, viz., discourse markers, filled Repetition or Correction, False Start, Edit, Aside.
pauses and interjections. These three types consists For the distinction of the categories Repetition
only of interregnum. Figure 2 shows breakdown of or Correction and False Start, it is important to
interregnum being empty and Figure 3 shows the consider that the phrase which has been abandoned
breakdown of reparandum, repair being empty. is repeated with only slight or no changes in the
syntactical structure. The change can be in the
Interruption form of Insertion, Deletion, or Substitution of
Point
words. The slight or no change identifies it as a
Repetition or Correction disfluency. On the other
Let us, okay, let us take a look here.
hand, if a completely different syntactical structure
with different semantics is chosen for the repair,
Reparandum Interregnum Repair
the observed disfluency is a false start.
Figure 1: Surface Structure of Disfluency
The disfluency classification is important
and is used to determine the type of disfluencies
Interruption one wants to correct in the disfluent text. It also
Point
forms the basis for the classifiers one can train
to learn the disfluency type domain embeddings.
So we will, , we can take a look here.
Generally, the approaches do not depend on
the type of disfluencies, but making explicit
Reparandum Interregnum Repair
use of the annotated corpus and incorporate the
Figure 2: Disfluencies with empty interregnum. knowledge of specific disfluency types into the
modelsisbeneficial. Table 1 describes the different
disfluency types, their definitions and examples.
Interruption
Point 4 Approaches
How about, , well, next week?
In this section, we will discuss two approaches to
correct disfluencies in disfluent text. The problem
Reparandum Interregnum Repair statement is: “Correct disfluencies present in tran-
Figure 3: Disfluencies with empty reparandum and scribed utterances (e.g.noisy ASR output) of con-
empty repair. versational speech (e.g. Telephonic conversations,
Lectures delivered, etc) by removing the “disflu-
3 Types of Disfluencies ent” part without changing the intended meaning
of the speaker.”
This section will describe the different types of 4.1 Style Transfer for Disfluency Correction
disfluencies that can be found in the disfluent
text. These disfluency types are present in the 1. Architecture
switchboard corpus. The annotation of disfluencies Figure 4 clearly shows the two directions of
Disfluency Type Description Constituents Example
Filled Pause Nonlexicalized sounds with no semantic uh, um, ah, etc We’re uh getting ready.
content.
Arestricted group of non lexicalized uh-huh, mhm,
sounds indicating affirmation or negation. mm,uh-uh, 1. I dropped my phone again, ugh.
Interjection Aninterjection is a part of speech nah, oops, yikes, 2. Oops, I didn’t mean it.
that demonstrates the emotion or feeling woops, phew, alas,
of the author. blah, gee, ugh.
Wordsthat are related to the structure
of the discourse in so far that they help 1. Well, this is good.
Discourse Marker beginning or keeping a turn or serve as okay, so, well, 2. This is, you know, a pretty
acknowledgment. They do not contribute youknow,etc good report.
to the semantic content. These are also
called linking words.
Exact repetition or correction of words
previously uttered. A correction may
Restart or Correction involve substitutions, deletions or - 1. This is is a bad bad situation.
insertions of words. However, the correction 2. Are you you happy?
continues with the same idea or train of thought
started previously.
Anutterance is aborted and restarted 1. We’ll never find a day
False Start with a new idea or train of thought. - what about next month ?
2. Yes no I’m not coming.
Phrases of words which occur after
that part of a disfluency which is repeated
or corrected afterwards or even abandoned Weneedtwotickets,
Edit completely. They refer explicitly to the - I’msorry, three tickets
words which just previously have been for the flight to Boston.
said, indicating that they are not intended
to belong to the utterance.
Table 1: Disfluency Types, Description and Examples.
translation. The model obtains latent disflu- improve the accuracy when the different
ent and latent fluent utterances from the non- sources are related.
parallel fluent and disfluent sentences, respec- • Similarly, when the targets are related,
tively, which are further reconstructed back parameter sharing helps to improve the
into fluent and disfluent sentences. A back- accuracy.
translation-based objective is employed, fol- • Parameter sharing allows the model to
lowed by reconstruction for both domains i.e. get benefit from the learning’s through
disfluent and fluent text. For every mini-batch the back-propagated loss of different
of training, soft translations for a domain are translation directions. Since we are only
¯ ¯
first generated (denoted by x and y in Fig- operating on the English language in the
ure 4), and are subsequently translated back source(disfluent) and target(fluent), it is
into their original domains to reconstruct the imperative to utilize the benefit of param-
mini-batch of input sentences. The sum of eter sharing.
token-level cross-entropy losses between the Disadvantages of Parameter Sharing:
input and the reconstructed output serves as
the reconstruction loss. • Sometimes, sharing of encoders and de-
Componentsinaneural model can be shared coders leads to burdening the parameters
minimally, completely, or in a controlled fash- to learn a large representation with lim-
ion. A complete parameter sharing is done, ited space.
which treats the model as a black box for This bottleneck can be avoided by increasing
both translation directions and offers maxi- the layers in both encoders and decoders. The
mum simplicity. Advantages of Parameter encoders and decoders are shared for both
Sharing: translation directions; disfluent-to-fluent and
• In sequence to sequence tasks, sharing fluent-to-disfluent. In a sequence to sequence
parameters between encoders helps to transduction task, the encoder takes an input
no reviews yet
Please Login to review.