jagomart
digital resources
picture1_Language Pdf 102691 | Camgoz Neural Sign Language Cvpr 2018 Paper


 140x       Filetype PDF       File size 1.66 MB       Source: openaccess.thecvf.com


File: Language Pdf 102691 | Camgoz Neural Sign Language Cvpr 2018 Paper
neural sign language translation necati cihan camgoz1 simon hadeld1 oscar koller2 hermann ney2 richard bowden1 1university of surrey n camgoz s hadfield r bowden surrey ac uk 2rwthaachenuniversity koller ney ...

icon picture PDF Filetype PDF | Posted on 23 Sep 2022 | 3 years ago
Partial capture of text on file.
                                                 Neural Sign Language Translation
                     Necati Cihan Camgoz1, Simon Hadfield1, Oscar Koller2, Hermann Ney2, Richard Bowden1
                                1University of Surrey, {n.camgoz, s.hadfield, r.bowden}@surrey.ac.uk
                                      2RWTHAachenUniversity,{koller, ney}@cs.rwth-aachen.de
                                      Abstract
                Sign Language Recognition (SLR) has been an active
             research field for the last two decades.      However, most
             research to date has considered SLR as a naive gesture
             recognition problem. SLR seeks to recognize a sequence of
             continuous signs but neglects the underlying rich grammat-
             ical and linguistic structures of sign language that differ                Figure 1. Difference between CSLR and SLT.
             from spoken language. In contrast, we introduce the Sign          of what a signer is saying. This translation task is illus-
             Language Translation (SLT) problem. Here, the objective           trated in Figure 1, where the sign language glosses give the
             is to generate spoken language translations from sign             meaning and the order of signs in the video, but the spoken
             language videos, taking into account the different word           language equivalent (which is what is actually desired) has
             orders and grammar.                                               both a different length and ordering.
                We formalize SLT in the framework of Neural Machine               Most of the research that has been conducted in SLR
             Translation (NMT) for both end-to-end and pretrained              to date has approached the task as a basic gesture recogni-
             settings (using expert knowledge). This allows us to jointly      tion problem, ignoring the linguistic properties of the sign
             learn the spatial representations, the underlying language        language and assuming that there is a one-to-one mapping
             model,andthemappingbetweensignandspokenlanguage.                  of sign to spoken words. Contrary to SLR, we propose to
                ToevaluatetheperformanceofNeuralSLT,wecollected                approach the full translation problem as a NMT task. We
             the first publicly available Continuous SLT dataset, RWTH-         use state-of-the-art sequence-to-sequence (seq2seq) based
                                         1
             PHOENIX-Weather 2014T . It provides spoken language               deep learning methods to learn: the spatio-temporal repre-
             translations and gloss level annotations for German Sign          sentation of the signs, the relation between these signs (in
             Language videos of weather broadcasts. Our dataset con-           other words the language model) and how these signs map
             tains over .95M frames with >67K signs from a sign vocab-         to the spoken or written language. To achieve this we in-
             ulary of >1K and >99K words from a German vocabulary              troduce new vision methods, which mirror the tokenization
             of >2.8K. We report quantitative and qualitative results for      andembeddingstepsofstandardNMT. Wealsopresentthe
             variousSLTsetupstounderpinfutureresearchinthisnewly               first continuous SLT dataset, RWTH-PHOENIX-Weather
             established field. The upper bound for translation perfor-         2014T, to allow future research to be conducted towards
             manceiscalculatedat19.26BLEU-4,whileourend-to-end                 sign to spoken language translation. The contributions of
             frame-level and gloss-level tokenization networks were able       this paper can be summarized as:
             to achieve 9.58 and 18.13 respectively.                              • Thefirst exploration of the video to text SLT problem.
             1. Introduction                                                      • The first publicly available continuous SLT dataset,
                                                                                    PHOENIX14T,whichcontains video segments, gloss
                Sign Languages are the primary language of the deaf                 annotations and spoken language translations.
             community. Despite common misconceptions, sign lan-                  • Abroadrangeofbaselineresultsonthenewcorpusin-
             guages have their own specific linguistic rules [55] and do             cluding a range of different tokenization and attention
             not translate the spoken languages word by word. There-                schemes in addition to parameter recommendations.
             fore, the numerousadvancesinSLR[15]andeventhemove                 The rest of this paper is organized as follows: In Section 2
             to the challenging Continuous SLR (CSLR) [33, 36] prob-           we survey the fields of sign language recognition, seq2seq
             lem, do not allow us to provide meaningful interpretations        learning and neural machine translation. In Section 3 we
                1https://www-i6.informatik.rwth-aachen.de/                     formalize the SLT task in the framework of neural ma-
              koller/RWTH-PHOENIX-2014-T/                                      chine translation and describe our pipeline. We then intro-
             ˜
                                                                          4321
                                                                            7784
              duce RWTH-PHOENIX-Weather 2014T, the first continu-                    Oneofthemostimportant breakthroughs in DL was the
              ous SLT dataset, in Section 4. We share our quantitative           development of seq2seq learning approaches. Strong anno-
              and qualitative experimental results in Sections 5 and 6, re-      tations are hard to obtain for seq2seq tasks, in which the
              spectively. Finally, we conclude our paper in Section 7 by         objective is to learn a mapping between two sequences. To
              discussing our findings and the future of the field.                 be able to train from weakly annotated data in an end-to-
              2. Related Work                                                    endmanner,Gravesetal. proposedConnectionistTemporal
                                                                                 Classification (CTC) Loss [25], which considers all possi-
                 There are various factors that have hindered progress to-       blealignmentsbetweentwosequenceswhilecalculatingthe
              wards SLT. Although there have been studies such as [9],           error. CTC quickly became a popular loss layer for many
              which recognized isolated signs to construct sentences, to         seq2seq applications. It has obtained state-of-the-art per-
              the best of our knowledge no dataset or study exists that          formance on several tasks in speech recognition [27, 2] and
              achieved SLT directly from videos, until now. In addition,         clearly dominates hand writing recognition [26]. Computer
              existing linguistic work on SLT has solely dealt with text to      vision researchers adopted CTC and applied it to weakly la-
              text translation. Despite only including textual information,      beled visual problems, such as lip reading [3], action recog-
              these have been very limited in size (averaging 3000 total         nition [30], hand shape recognition [6] and CSLR [6, 17].
              words) [46, 54, 52]. The first important factor is that col-
              lection and annotation of continuous sign language data is            Another common seq2seq task is machine translation,
              a laborious task. Although there are datasets available from       which aims to develop methods that can learn the mapping
              linguistic sources [51, 28] and sign language interpretations      between two languages. Although CTC is popular, it is not
              from broadcasts [14], they are weakly annotated and lack           suitable for machine translation as it assumes source and
              the human pose information which legacy sign language              target sequences share the same order. Furthermore, CTC
              recognition methods heavily relied on. This has resulted in        assumes conditional independence within target sequences,
              manyresearchers collecting isolated sign language datasets         which doesn’t allow networks to learn an implicit language
              [63, 7] in controlled environments with limited vocabulary,        model. This led to the development of Encoder-Decoder
              thus inhibiting the end goal of SLT. The lack of a baseline        Network architectures [31] and the emergence of the
              dataset for SLR has rendered most research incomparable,           NMTfield [47]. The main idea behind Encoder-Decoder
              robbing the field of competitive progress.                          Networks is to use an intermediary latent space to map
                 With the development of algorithms that were capa-              two sequences, much like the latent space in auto-encoders
              ble of learning from weakly annotated data [5, 50, 14]             [24], but applied to temporal sequences. This is done by
              and the improvements in the field of human pose estima-             first encoding source sequences to a fixed sized vector
              tion [10, 59, 8], working on linguistic data and sign lan-         and then decoding target sequences from this. The first
              guageinterpretationsfrombroadcastsbecameafeasibleop-               architecture proposed by Kalchbrenner and Blunsom [31]
              tion. Following these developments, Forster et al. released        used a single RNN for both encoding and decoding tasks.
              RWTH-PHOENIX-Weather2012[20]anditsextendedver-                     Later Sutskever et al. [56] and Cho et al. [11] proposed
              sionRWTH-PHOENIX-Weather2014[21],whichwascap-                      delegating encoding and decoding to two separate RNNs.
              tured from sign language interpretations of weather fore-             Although encoder-decoder networks improved machine
              casts.  The PHOENIX datasets were created for CSLR                 translation performance, there is still the issue of an infor-
              and they provide sequence level gloss annotations. These           mation bottleneck caused by encoding the source sequence
              datasets quickly became a baseline for CSLR.                       into a fixed sized vector and the long term dependencies be-
                 Concurrently, Deep Learning (DL) [39] has gained pop-           tween source and target sequence. To address these issues,
              ularity and achieved state-of-the-art performance in various       Bahdanauetal. [4]proposedpassingadditionalinformation
              fields such as Computer Vision [38], Speech Recognition             to the decoder using an attention mechanism. Given en-
              [2] and more recently in the field of Machine Translation           coder outputs, their attention function calculates the align-
              [47]. Until recently SLR methods have mainly used hand-            ment between source and target sequences. Luong et al.
              crafted intermediate representations [33, 16] and the tem-         [44] further improved this approach by introducing addi-
              poral changes in these features have been modelled using           tional types of attention score calculation and the input-
              classical graph based approaches, such as Hidden Markov            feedingapproach. Sincethen,variousattentionbasedarchi-
              Models (HMMs) [58], Conditional Random Fields [62]                 tectures have been proposed for NMT, such as GNMT [60]
              or template based methods [5, 48].       However, with the         that combines bi-directional and uni-directional encoders in
              emergence of DL, SLR researchers have quickly adopted              adeeparchitectureand[22]whichintroducedaconvolution
              Convolutional Neural Networks (CNNs) [40] for manual               based seq2seq learning approach. Similar attention based
              [35, 37] and non-manual [34] feature representation, and           approaches have been applied to various Computer Vision
              RecurrentNeuralNetworks(RNNs)fortemporalmodelling                  tasks, such as image captioning [61], lip reading [13] and
              [6, 36, 17].                                                       action recognition [19].
                                                                              7785
                        Figure 2. An overview of our SLT approach that generates spoken language translations of sign language videos.
             3. Neural Sign Language Translation                             tences), we need to learn spatial embeddings to represent
                Translating sign videos to spoken language is a seq2seq      sign videos. To achieve this we utilize 2D CNNs. Given
             learning problem by nature. Our objective is to learn the       a sign video x, our CNN learns to extract non-linear frame
             conditional probability p(y|x) of generating a spoken lan-      level spatial representations as:
             guage sentence y = (y1;y2;:::;yU) with U number of                            ft = SpatialEmbedding(xt)                (1)
             words given a sign video x = (x ;x ;:::;x ) with T
                                                  1  2      T
             number of frames. This is not a straight forward task as        where ft corresponds to the feature vector produced by
             the number of frames in a sign video is much higher than        propagating a video frame xt through our CNN.
             the number of words in its spoken language translation             Forwordembedding,weuseafullyconnectedlayerthat
             (i.e. T ≫ U).Furthermore, thealignmentbetweensignand            learns a linear projection from one-hot vectors of spoken
             spoken language sequences are usually unknown and non-          language words to a denser space as:
             monotonic. In addition, unlike other translation tasks that                    g =WordEmbedding(y )                    (2)
                                                                                             u                        u
             workontext,oursourcesequencesarevideos. Thisrenders
             the use of classic sequence modeling architectures such as      where gu is the embedded version of the spoken word yu.
             the RNN difficult. Instead, we propose combining CNNs            3.2. Tokenization Layer:
             with attention-based encoder-decoders to model the condi-          In NMTtheinputandoutputsequencescanbetokenized
             tional probability p(y|x). We experiment with training our      at many different levels of complexity: characters, words,
             approachinanend-to-endmannertojointlylearnthealign-             N-grams or phrases. Low level tokenization schemes, such
             ment and the translation of sign language videos to spoken      asthecharacterlevel, allowsmallervocabulariestobeused,
             language sentences. An overview of our approach can be          but greatly increase the complexity of the sequence model-
             seen in Figure 2. In the remainder of this section, we will     ingproblem,andrequirelongtermrelationshipstobemain-
             describe each component of our architecture in detail.          tained. High level tokenization makes the recognition prob-
             3.1. Spatial and Word Embeddings:                               lem far more difficult due to vastly increased vocabularies,
                Neural machine translation methods start with tokeniza-      but the language modeling generally only needs to consider
             tion of source and target sequences and projecting them to      a small number of neighboring tokens.
             a continuous space by using word embeddings [45]. The              Asthere has been no previous research on SLT, it is not
             main idea behind using word embeddings is to transform          clear what tokenization schemes are most appropriate for
             the sparse one-hot vector representations, where each word      this problem. This is exacerbated by the fact that, unlike
             is equidistant from each other, into a denser form, where       NMTresearch, there is no simple equivalence between the
             words with similar meanings are closer.     These embed-        tokenizations of the input sign video and the output text.
             dings are either learned from scratch or pretrained on larger   The framework developed in this paper is generic and can
             datasets and fine-tuned during training. However, contrary       usevarioustokenizationschemesonthespatialembeddings
             totext, signs are visual. Therefore, in addition to using word  sequence f1:T
             embeddingsforourtargetsequences(spokenlanguagesen-                             z1:N = Tokenization(f1:T)               (3)
                                                                          7786
                In the experiments we explore both “frame level” and           these errors are back propagated through the encoder-
             “gloss level” input tokenization, with the latter exploiting      decoder network to the CNN and word embeddings, thus
             anRNN-HMMforcedalignmentapproach[36]. Theoutput                   updating all of the network parameters.
             tokenization is at the word level (as in most modern NMT          Attention Mechanisms:
             research) but could be an interesting avenue for the future.      Amajordrawbackofusingaclassicencoder-decoderarchi-
             3.3. Attention­based Encoder­Decoder Networks:                    tecture is the information bottleneck caused by representing
                To be able to generate the target sentence y from to-          a whole sign language video with a fixed sized vector. Fur-
             kenized embeddings z1:N of a sign video x, we need to             thermore, due to large number of frames, our networks suf-
             learn a mapping function B(z1:N) → y which will maxi-             fer from long term dependencies and vanishing gradients.
             mize the probability p(y|x). We propose modelling B us-           To overcome these issues, we utilize attention mechanisms
             ing an attention-based encoder-decoder network, which is          to provide additional information to the decoding phase. By
             composed of two specialized deep RNNs. By using these             using attention mechanisms our networks are able to learn
             RNNswebreak down the task into two phases. In the en-             where to focus while generating each word, thus provid-
             codingphase, a sign videos’ features are projected into a la-     ing the alignment of sign videos and spoken language sen-
             tent space in the form of a fixed size vector, later to be used    tences. We employ the most prominent attention approach
             in the decoding phase for generating spoken sentences.            proposed by Bahdanau et al. [4] and later improved by Lu-
                During the encoding phase, the encoder network, reads          onget al. [44].
             inthefeaturevectorsonebyone. Givenasequenceofrepre-                  The main idea behind attention mechanisms is to create
             sentations z1:N, we first reverse its order in the temporal do-    a weighted summary of the source sequence to aid the de-
             main, as suggested by [56], to shorten the long term depen-       coding phase. This summary is commonly known as the
             dencies between the beginning of sign videos and spoken           context vector and it will be notated as cu in this paper. For
             language sentences. We then feed the reversed sequence            each decoding step u, a new context vector cu is calculated
             zN:1 to the Encoder which models the temporal changes in          bytaking a weighted sum of encoder outputs o1:N as:
                                                                                                            N
             video frames and compresses their cumulative representa-                                 cu = Xγuon                        (7)
             tion in its hidden states as:                                                                       n
                              o =Encoder(z ;o          )               (4)                                 n=1
                                n               n n+1                          where γu represent the attention weights, which can be in-
                                                                                       n
             where on is the hidden state produced by recurrent unit n,        terpreted as the relevance of an encoder input zn to gen-
             oN+1 is a zero vector and the final encoder output o1 corre-       erating the word yu. When visualized, attention weights
             spondstothelatentembeddingofthesequencehsign which                also help to display the alignments between sign videos and
             is passed to the decoder.                                         spoken language sentences learned by the encoder-decoder
                Thedecodingphasestartsbyinitializing hidden states of          network. These weights are calculated by comparing the
             the decoder network using the latent vector h       . In the      decoder hidden state h against each output o as:
                                                             sign                                     u                       t
             classic encoder-decoder architecture [56], this latent rep-                              exp(score(h ;o ))
                                                                                            γu =                   u   n                (8)
             resentation is the only information source of the decoding                       n    PN exp(score(h ;o ′)
                                                                                                     n′=1              u   n
             phase. By taking its previous hidden state (hu−1) and the         wherethescoringfunctiondependsontheattentionmecha-
             word embedding (gu−1) of the previously predicted word            nism that is being used. In this work we examine two scor-
             (yu−1) as inputs, the decoder learns to generate the next         ing functions. The first one is a multiplication based ap-
             wordinthesequence(y )andupdateitshiddenstate(h ):
                                     u                                u        proach proposed by Luong et al. [44] and the second is a
                           y ;h =Decoder(g          ; h    )           (5)
                            u   u               u−1    u−1                     concatenation based function proposed by Bahdanau et al.
             where h0 = hsign is the spatio-temporal representation of         [4]. These functions are as follows:
                                                                                                ⊤
             sign language video learned by the Encoder and y0 is the          score(h ;o )= huWon                    [Multiplication] (9)
                                                                                       u   n      ⊤
             special token < bos > indicating the beginning of a sen-                           V tanh(W[hu;on]) [Concatenation]
             tence. This procedure continues until another special to-         whereW andV arelearnedparameters. Thecontextvector
             ken < eos >, which indicates the end of a sentence, is pre-       c   is then combined with the hidden state h     to calculate
                                                                                u                                            u
             dicted. By generating sentences word by word, the Decoder         the attention vector au as:
             decomposes the conditional probability p(y|x) into ordered                          a =tanh(W [c ;h ])                    (10)
                                                                                                   u            c  u  u
             conditional probabilities:                                        Finally, we feed the a   to a fully connected layer to model
                                     U                                                                u
                          p(y|x) = Y p(y |y          ; h    )          (6)     the ordered conditional probability in Equation 6. Further-
                                            u 1:u−1    sign
                                    u=1                                        moreauisfedtothenextdecodingstepu+1thuschanging
             which is used to calculate the errors by applying cross en-       Equation 5 to:
             tropy loss for each word. For the end-to-end experiments,                    y ;h =Decoder(g          ; h   ; a    )      (11)
                                                                                           u   u               u−1    u−1   u−1
                                                                            7787
The words contained in this file might help you see if this file matches what you are looking for:

...Neural sign language translation necati cihan camgoz simon hadeld oscar koller hermann ney richard bowden university of surrey n s hadfield r ac uk rwthaachenuniversity cs rwth aachen de abstract recognition slr has been an active research eld for the last two decades however most to date considered as a naive gesture problem seeks recognize sequence continuous signs but neglects underlying rich grammat ical and linguistic structures that differ figure difference between cslr slt from spoken in contrast we introduce what signer is saying this task illus here objective trated where glosses give generate translations meaning order video videos taking into account different word equivalent which actually desired orders grammar both length ordering formalize framework machine conducted nmt end pretrained approached basic recogni settings using expert knowledge allows us jointly tion ignoring properties learn spatial representations assuming there one mapping model andthemappingbetweensigna...

no reviews yet
Please Login to review.