Language Pdf 99575

Partial capture of text on file.
                     Automatic Assessment of English CEFR Levels Using BERT Embeddings
                                                                                        1,3                     1,2
                                                   Veronica Juliana Schmalz                , Alessio Brutti
                                                 1. Free University of Bozen-Bolzano, Bolzano, Italy
                                                       2. Fondazione Bruno Kessler, Trento, Italy
                                            3. KULeuven, imec research group itec, Kortrijk, Belgium
                                   veronicajuliana.schmalz@kuleuven.be,brutti@fbk.it
                                             Abstract                                 1    Introduction
                         The automatic assessment of language                         Finding a system which objectively evaluates lan-
                         learners’ competences represents an in-                      guage learners’ competences is a daunting task.
                         creasingly promising task thanks to recent                   Several aspects need to be considered, including
                         developments in NLP and deep learning                        both subjective factors, like age, native language,
                         technologies. In this paper, we propose the                  cognitive capacities of the learner, and learning-
                         use of neural models for classifying En-                     related factors, for example the amount and type
                         glish written exams into one of the Com-                     ofreceivedlinguisticinput(James,2005;Chapelle
                         mon European Framework of Reference                          and Voss, 2008; Jang, 2017). Indeed, language
                         for Languages (CEFR)competencelevels.                        competences are not holistic, but concern differ-
                         We employ pre-trained Bidirectional En-                      ent domains, so that considering the mere formal
                         coder Representations from Transformers                      correctness of learners’ language has been shown
                         (BERT) models which provide efficient                        not to represent a proper assessment procedure
                         and rapid language processing on account                     (Roever and McNamara, 2006; Harding and Mc-
                         of attention-based mechanisms and the ca-                    Namara, 2017; Chapelle, 2017). Moreover, hu-
                         pacity of capturing long-range sequence                      manevaluators, despite having to adhere to a pre-
                         features. In particular, we investigate on                   defined scale and guidelines, such as the CEFR
                         augmentingtheoriginallearner’s text with                     (Council     of Europe, 2001), have proved to be
                         corrections provided by an automatic tool                    biased (Karami, 2013) and inaccurate (Figueras,
                         or by human evaluators. We consider dif-                     2012). For these reasons, new language testing
                         ferent architectures where the texts and                     methods and tools have been developed.                 Cur-
                         corrections are combined at an early stage,                  rent state-of-the-art models, such as Transform-
                         via concatenation before the BERT net-                       ers, allow to process numerous and complex lin-
                         work, or as late fusion of the BERT em-                      guistic data efficiently and rapidly, by means of
                         beddings. The proposed approach is eval-                     attention-based mechanisms and deep neural net-
                         uated on two open-source datasets: the                       worksthatcapturetherelevant features for the tar-
                         English First Cambridge open language                        geted task. However, the creation and access to
                         Database (EFCAMDAT) and the Cam-                             necessary language examination resources includ-
                         bridge Learner Corpus for the First Cer-                     ing annotations and metadata appear to date lim-
                         tificate in English (CLC-FCE). The ex-                       ited. In this paper, we propose using a series of
                         perimental results show that the proposed                    BERT-base models to automatically assign CEFR
                         approach can predict the learner’s compe-                    levels to language learners’ exams.
                         tencelevelwithremarkablyhighaccuracy,                           Our aim is examining the possibility of provid-
                         in particular when large labelled corpora                    ing the system with previously generated correc-
                         are available.     In addition, we observed                  tions, either by humans or automatically with a
                         that augmenting the input text with correc-                  language checker. Additionally, we want to anal-
                         tions provides further improvement in the                    yse the impact of the amount of data on the ac-
                         automatic language assessment task.                          curacy of the model in the classification of writ-
                         Copyright © 2021 for this paper by its authors. Use per-     ten exams taken from the English First Cam-
                    mitted under Creative Commons License Attribution 4.0 In-
                    ternational (CC BY 4.0).                                          bridge Open Language Database (EFCAMDAT)
                  (Geertzen et al., 2013) and the Cambridge Learner       more, a standard scale is needed, which can be ex-
                  Corpus for the First Certificate in English (CLC-       tended between different groups of learners. In
                  FCE) (Yannakoudakis et al., 2011). In this way,         addition, powerful computational resources, and
                  a significant turning point could be made both in       in certain cases, significant memory, are required.
                  improving the functioning of these automatic sys-       All these elements together constitute fundamen-
                  temsandinthefuturecollectionofdatafromother             tal pre-requisites which can be difficultly fulfilled.
                  languages.                                              For this reason, we present a distinct approach
                                                                          to the previous ones which, starting from differ-
                  2   Related Works                                       ent amounts of students’ original texts, provides a
                                                                          classification within the different CEFR levels ex-
                  Automatic language assessment methods concern           ploiting BERT-base models and subsidiary correc-
                  the creation of fast, effective, unbiased and cross-    tions.
                  linguistically valid systems that can both sim-
                  plify assessment and render it objective. However,      3   Proposed Approach
                  achieving such results represents a complex task
                  that researchers have been addressing for years         The approach we propose for the automatic as-
                  while experimenting with several methodologies          sessment of the language competences of adult
                  and techniques. The first developed tools used to       English language learners is based on the use of
                  mainlydealwithwrittentexts and exploited Parts-         Transformer-type architectures performing multi-
                  of-Speech (PoS) tagging to grade students’ essays       class classification.  Among these, BERT-based
                  (Burstein et al., 2013), and latent semantic anal-      models, characterised by efficient parallel training
                  ysis to evaluate the content, providing also short      andthecapacity of capturing long-range sequence
                  feedback (Landauer, 2003). Advances in AI, NLP          features, distinguish themselves for their size and
                  and Automatic Speech Recognition (ASR) led to           amount of training data (Vaswani et al., 2017).
                  the additional emergence of systems that assess         Being pre-trained on generic large corpora, with
                  spoken language skills, such as the SpeechRater         Masked Language Modelling (MLM) and Next
                  (Xi et al., 2008), which considers clarity of ex-       Sentence Prediction (NSP) strategies, they can be
                  pression, pronunciation and fluency. To date, sev-      conveniently employed in a wide range of tasks,
                  eral other automatic language assessment tools          including text classification, language understand-
                  are applied in the domain of large scale testing,       ing and machine translation.
                  for example Criterion (Attali, 2004), Project Es-         The models we use for our experiments are
                  say Grade (Wilson and Roscoe, 2020), MyAccess!          groundedontheBERT-base-uncased architecture,
                  (Chen and Cheng, 2008) and Pigai (Zhu, 2019).           part of the Hugging Face Transformers Library re-
                  The first can detect grammatical and usage-based        leased in 2019 (Wolf et al., 2020) and inspired by
                  errors, as well as punctuation mistakes, provid-        BERT(Devlinetal.,2018)fromGoogleResearch,
                  ing also feedback.    However, it requires being        that encodes input texts into low-dimensional em-
                  trained on the specific topics to assess. The sec-      beddings. Ourbaselinemodelmapsthesecompact
                  ondsystemexploitsatrainingsetofhuman-scored             representations into the CEFR levels using a net-
                  essays to score unseen texts, evaluating diction,       work with two fully connected layers. Fig. 1(a)
                  grammar and complexity from statistical and lin-        graphically represents the architecture. Note that
                  guistic models. Similarly, MyAccess!, calibrated        this approach requires training the final classifier
                  with a large number of essays, can score learn-         only. Retraining or fine-tuning the BERT model
                  ers’ texts and measure advanced features such as        would probably require very large datasets which
                  syntactic and lexical complexity, content develop-      are not always available for this task. In order to
                  ment and word choice, providing detailed feed-          augmenttheinput text with corrections (either au-
                  back.   On the contrary, Pigai, exploits NLP to         tomatic or human) we investigate two possible di-
                  compare the essays submitted by students with           rections. The first one (Fig. 1(b)) concatenates the
                  those contained in its corpora, measuring the dis-      twotextsandappliesthepre-trainedBERTmodel.
                  tance between the two (Zhu, 2019). Despite the          The resulting embeddings are expected to encode
                  extreme efficiency of these tools, to perform ac-       the information related to both texts. Conversely,
                  curately they generally need large amounts of la-       the second architecture extracts individual embed-
                  belledandhuman-correctedtrainingdata. Further-          dings for the original texts and the corrected ones.
                             (a)                               (b)                                         (c)
                  Figure 1: Proposed architectures for CEFR prediction. a) Baseline: original learners’ texts as input;
                  b) Concatenation: model taking the original learners’ texts and the corrections concatenated; c) Two-
                  streams: model processing the original learners’ texts and the corrections with separate streams.
                  These are then merged and processed by the clas-        the CEFR proficiency ones. Each essay has been
                  sifier, as shown in Fig. 1(c).                          correctedandevaluatedbylanguageinstructors;in
                    Weresort to these types of models to be able to       addition to the original texts, their corrected ver-
                  efficiently process texts capturing long-range se-      sions and annotated errors are also included.
                  quencefeaturesthankstoparallelword-processing             Weconsideredasub-set of the dataset compris-
                  and self-attention mechanisms. Regardless of the        ing 100,000 tests.   Table 1 reports the distribu-
                  length of the texts, the architecture should be, in-    tion of the exams across the different CEFR levels,
                  deed, able to accurately categorise the examina-        including also the average numbers of violations
                  tions according to the CEFR A1, A2, B1, B2              identified by both humans evaluators and the auto-
                  and C1 levels of competence. These, in fact, are        matic tool, normalized by the average text length.
                  fed to the model as labels during the training to-      Note that the average errors per word decrease as
                  gether with single contextual embeddings, or con-       the level of competence increases. Observe also
                  catenated ones if corrections are included. Note        that the automaticerrorstendtobemorenumerous
                  that we do not provide the model with any indica-       than the human ones, in particular for low compe-
                  tion about the types of errors in the original text.    tence levels. We use the official test partition com-
                  Thisinformationisdirectlyextractedbythemodel            posed of 1,447 essays. The development set is a
                  whenprocessing the original text together with its      20%subsetofthetraining set.
                  corrected version.
                  4   Experimental Analysis                               4.2   CLC-FCEDataset
                                                                          The CLC-FCEdataset is a collection of texts pro-
                  Weevaluatethearchitectures described above, us-         duced by adult learners for English as a Second
                  ing both automatic and human corrections, on            or Other Language(ESOL)examinationsfromthe
                  two English open-source datasets: EFCAMDAT              First Certificate in English (FCE) written exam
                  and CLC-FCE. We also experiment varying the             to attest a B2 CEFR level (Yannakoudakis et al.,
                  amount of training material. The performance of         2011). The learners’ productions, consisting of
                  the models is measured in terms of weighted clas-       two texts, have been evaluated with a score be-
                  sification accuracy.                                    tween0and5.3andtheerrorshavebeenclassified
                  4.1   EFCAMDATDataset                                   in 77 classes. Following the guidelines of the au-
                                                                          thors, the average score of the two texts has been
                  The EFCAMDAT dataset constitutes one of the             mappedtoCEFRlevels,asshowninTable2. Note
                  largest language learners datasets currently avail-     that only 4 levels are available in this dataset and
                  able (Geertzen et al., 2013). The version we use        that the labels do not uniformly match the ones
                  contains 1,180,310 essays submitted by adult En-        present in EFCAMDAT. Table 2 reports also the
                  glish learners from more than 172 different nation-     distributions of the texts across the 4 classes with
                  alities, covering 16 distinct levels compliant with     the error partitions. We notice that, in this case,
                                     levels  n. exams     average     manualerrors      automatic errors
                                                           length       per word             per word
                                                                               −2                   −2
                                      A1       37,290        40          4·10                10·10
                                                                               −2                  −2
                                      A2       36,618        67          4·10                6·10
                                                                               −2                  −2
                                      B1       18,119        92          4·10                5·10
                                                                               −2                  −2
                                      B2       6,042        129          3·10                4·10
                                                                               −2                  −2
                                      C1       1,732        170          2·10                3·10
                  Table 1: EFCAMDATdataset (sample of 100,000 exams): number of exams per CEFR level, mean text
                  length (in tokens), mean number of manually and automatically annotated errors per word.
                                scores    levels   N. exams     average    manualerrors       automatic errors
                                                                 length       per word             per word
                                                                                      −2                 −2
                               0.0 - 1.1    A2         10         220          16·10               7·10
                                                                                      −2                 −2
                               1.2 - 2.3    B1        417         205          14·10               7·10
                                                                                     −2                  −2
                               3.1 - 4.3    B2       1,414        212          9·10                6·10
                                                                                     −2                  −2
                               5.1 - 5.3    C1        265         234          6·10                4·10
                  Table 2: CLC-FCE dataset: assigned scores and number of exams per CEFR level, mean text length (in
                  tokens), mean number of manually and automatically annotated errors per word.
                  manual errors have been annotated more in de-           is based on surface text processing, does not use a
                  tail and they are indeed more numerous than the         deepparseranddoesnotrequireafullyformalised
                  automatic ones.    In general, the number of er-        grammar. By means of this, we have applied the
                  rors is higher than what observed in EFCAMDAT.          pre-defined rules for the English language to the
                  Also for this corpus the average amount of errors       learners’ essays, generating new correct texts for
                  per word, both automatic and manual, decreases          EFCAMDATandforCLC-FCE. Thesewereused
                  as the level increases. The total number of texts       as additional input data for the experiments.
                  within the corpus is 2,469. We employed a data
                  partition according to which 2,017 examinations         4.4   Implementation Details
                  constituted the training set, whereas the remain-       Our models have been implemented using
                  ing 194 constituted the test set. Differently, 10%      Keras and Hugging-Face’s pre-trained BERT-
                  of the training material represented the validation     base-uncased architecture (Wolf et al., 2020). The
                  set. From the entire corpus we had to exclude           models’ encoder module, consisting of a Multi-
                  10 texts since they were not provided with an as-       HeadAttention and Feed Forward component, re-
                  signed score. Despite its small size, CLC-FCE           ceives as inputs the original learners’ exams, to-
                  represents an important resource given its system-      gether with additional possible human or auto-
                  atic analysis of errors and the human corrections       matic corrections.    The transformed contextual
                  provided.                                               embeddings are obtained applying Global Aver-
                                                                          agePoolingtotheoutputsofthepre-trainedfrozen
                  4.3   LanguageTool                                      BERT Head. The classifier consists of a Dense
                  In both datasets, the content written by language       layer of 768 units, with activation function ReLu
                  learners varies according to the levels of compe-       and a Dropout rate of 0.2, followed by another
                  tence they were supposed to demonstrate. In ad-         Dense layer with less units, 128, and the same ac-
                                                                                                             1
                  dition to the human corrections provided with the       tivation function and Dropout rate .
                  data, we have generated automatic corrections us-         Lastly, the output layer consists of a Dense layer
                  ing LanguageTool (Miøkowski, 2010), a language          with Softmax as activation function and the mod-
                  checker capable of detecting grammatical, syntac-       els’ final logits correspond to the different CEFR
                  tical, orthographic and stylistic errors to automat-    levels within which the texts are respectively clas-
                  ically correct texts of different nature and length        1https://www.kaggle.com/akensert/bert-base-tf2-0-now-
                  (Naber and others, 2003). The automatic checker         huggingface-transformer
The words contained in this file might help you see if this file matches what you are looking for:

...Automatic assessment of english cefr levels using bert embeddings veronica juliana schmalz alessio brutti free university bozen bolzano italy fondazione bruno kessler trento kuleuven imec research group itec kortrijk belgium veronicajuliana be fbk it abstract introduction the language finding a system which objectively evaluates lan learners competences represents an in guage is daunting task creasingly promising thanks to recent several aspects need considered including developments nlp and deep learning both subjective factors like age native technologies this paper we propose cognitive capacities learner use neural models for classifying en related example amount type glish written exams into one com ofreceivedlinguisticinput james chapelle mon european framework reference voss jang indeed languages competencelevels are not holistic but concern differ employ pre trained bidirectional ent domains so that considering mere formal coder representations from transformers correctness has ...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area