163x Filetype PDF File size 0.66 MB Source: ceur-ws.org
Automatic Assessment of English CEFR Levels Using BERT Embeddings 1,3 1,2 Veronica Juliana Schmalz , Alessio Brutti 1. Free University of Bozen-Bolzano, Bolzano, Italy 2. Fondazione Bruno Kessler, Trento, Italy 3. KULeuven, imec research group itec, Kortrijk, Belgium veronicajuliana.schmalz@kuleuven.be,brutti@fbk.it Abstract 1 Introduction The automatic assessment of language Finding a system which objectively evaluates lan- learners’ competences represents an in- guage learners’ competences is a daunting task. creasingly promising task thanks to recent Several aspects need to be considered, including developments in NLP and deep learning both subjective factors, like age, native language, technologies. In this paper, we propose the cognitive capacities of the learner, and learning- use of neural models for classifying En- related factors, for example the amount and type glish written exams into one of the Com- ofreceivedlinguisticinput(James,2005;Chapelle mon European Framework of Reference and Voss, 2008; Jang, 2017). Indeed, language for Languages (CEFR)competencelevels. competences are not holistic, but concern differ- We employ pre-trained Bidirectional En- ent domains, so that considering the mere formal coder Representations from Transformers correctness of learners’ language has been shown (BERT) models which provide efficient not to represent a proper assessment procedure and rapid language processing on account (Roever and McNamara, 2006; Harding and Mc- of attention-based mechanisms and the ca- Namara, 2017; Chapelle, 2017). Moreover, hu- pacity of capturing long-range sequence manevaluators, despite having to adhere to a pre- features. In particular, we investigate on defined scale and guidelines, such as the CEFR augmentingtheoriginallearner’s text with (Council of Europe, 2001), have proved to be corrections provided by an automatic tool biased (Karami, 2013) and inaccurate (Figueras, or by human evaluators. We consider dif- 2012). For these reasons, new language testing ferent architectures where the texts and methods and tools have been developed. Cur- corrections are combined at an early stage, rent state-of-the-art models, such as Transform- via concatenation before the BERT net- ers, allow to process numerous and complex lin- work, or as late fusion of the BERT em- guistic data efficiently and rapidly, by means of beddings. The proposed approach is eval- attention-based mechanisms and deep neural net- uated on two open-source datasets: the worksthatcapturetherelevant features for the tar- English First Cambridge open language geted task. However, the creation and access to Database (EFCAMDAT) and the Cam- necessary language examination resources includ- bridge Learner Corpus for the First Cer- ing annotations and metadata appear to date lim- tificate in English (CLC-FCE). The ex- ited. In this paper, we propose using a series of perimental results show that the proposed BERT-base models to automatically assign CEFR approach can predict the learner’s compe- levels to language learners’ exams. tencelevelwithremarkablyhighaccuracy, Our aim is examining the possibility of provid- in particular when large labelled corpora ing the system with previously generated correc- are available. In addition, we observed tions, either by humans or automatically with a that augmenting the input text with correc- language checker. Additionally, we want to anal- tions provides further improvement in the yse the impact of the amount of data on the ac- automatic language assessment task. curacy of the model in the classification of writ- Copyright © 2021 for this paper by its authors. Use per- ten exams taken from the English First Cam- mitted under Creative Commons License Attribution 4.0 In- ternational (CC BY 4.0). bridge Open Language Database (EFCAMDAT) (Geertzen et al., 2013) and the Cambridge Learner more, a standard scale is needed, which can be ex- Corpus for the First Certificate in English (CLC- tended between different groups of learners. In FCE) (Yannakoudakis et al., 2011). In this way, addition, powerful computational resources, and a significant turning point could be made both in in certain cases, significant memory, are required. improving the functioning of these automatic sys- All these elements together constitute fundamen- temsandinthefuturecollectionofdatafromother tal pre-requisites which can be difficultly fulfilled. languages. For this reason, we present a distinct approach to the previous ones which, starting from differ- 2 Related Works ent amounts of students’ original texts, provides a classification within the different CEFR levels ex- Automatic language assessment methods concern ploiting BERT-base models and subsidiary correc- the creation of fast, effective, unbiased and cross- tions. linguistically valid systems that can both sim- plify assessment and render it objective. However, 3 Proposed Approach achieving such results represents a complex task that researchers have been addressing for years The approach we propose for the automatic as- while experimenting with several methodologies sessment of the language competences of adult and techniques. The first developed tools used to English language learners is based on the use of mainlydealwithwrittentexts and exploited Parts- Transformer-type architectures performing multi- of-Speech (PoS) tagging to grade students’ essays class classification. Among these, BERT-based (Burstein et al., 2013), and latent semantic anal- models, characterised by efficient parallel training ysis to evaluate the content, providing also short andthecapacity of capturing long-range sequence feedback (Landauer, 2003). Advances in AI, NLP features, distinguish themselves for their size and and Automatic Speech Recognition (ASR) led to amount of training data (Vaswani et al., 2017). the additional emergence of systems that assess Being pre-trained on generic large corpora, with spoken language skills, such as the SpeechRater Masked Language Modelling (MLM) and Next (Xi et al., 2008), which considers clarity of ex- Sentence Prediction (NSP) strategies, they can be pression, pronunciation and fluency. To date, sev- conveniently employed in a wide range of tasks, eral other automatic language assessment tools including text classification, language understand- are applied in the domain of large scale testing, ing and machine translation. for example Criterion (Attali, 2004), Project Es- The models we use for our experiments are say Grade (Wilson and Roscoe, 2020), MyAccess! groundedontheBERT-base-uncased architecture, (Chen and Cheng, 2008) and Pigai (Zhu, 2019). part of the Hugging Face Transformers Library re- The first can detect grammatical and usage-based leased in 2019 (Wolf et al., 2020) and inspired by errors, as well as punctuation mistakes, provid- BERT(Devlinetal.,2018)fromGoogleResearch, ing also feedback. However, it requires being that encodes input texts into low-dimensional em- trained on the specific topics to assess. The sec- beddings. Ourbaselinemodelmapsthesecompact ondsystemexploitsatrainingsetofhuman-scored representations into the CEFR levels using a net- essays to score unseen texts, evaluating diction, work with two fully connected layers. Fig. 1(a) grammar and complexity from statistical and lin- graphically represents the architecture. Note that guistic models. Similarly, MyAccess!, calibrated this approach requires training the final classifier with a large number of essays, can score learn- only. Retraining or fine-tuning the BERT model ers’ texts and measure advanced features such as would probably require very large datasets which syntactic and lexical complexity, content develop- are not always available for this task. In order to ment and word choice, providing detailed feed- augmenttheinput text with corrections (either au- back. On the contrary, Pigai, exploits NLP to tomatic or human) we investigate two possible di- compare the essays submitted by students with rections. The first one (Fig. 1(b)) concatenates the those contained in its corpora, measuring the dis- twotextsandappliesthepre-trainedBERTmodel. tance between the two (Zhu, 2019). Despite the The resulting embeddings are expected to encode extreme efficiency of these tools, to perform ac- the information related to both texts. Conversely, curately they generally need large amounts of la- the second architecture extracts individual embed- belledandhuman-correctedtrainingdata. Further- dings for the original texts and the corrected ones. (a) (b) (c) Figure 1: Proposed architectures for CEFR prediction. a) Baseline: original learners’ texts as input; b) Concatenation: model taking the original learners’ texts and the corrections concatenated; c) Two- streams: model processing the original learners’ texts and the corrections with separate streams. These are then merged and processed by the clas- the CEFR proficiency ones. Each essay has been sifier, as shown in Fig. 1(c). correctedandevaluatedbylanguageinstructors;in Weresort to these types of models to be able to addition to the original texts, their corrected ver- efficiently process texts capturing long-range se- sions and annotated errors are also included. quencefeaturesthankstoparallelword-processing Weconsideredasub-set of the dataset compris- and self-attention mechanisms. Regardless of the ing 100,000 tests. Table 1 reports the distribu- length of the texts, the architecture should be, in- tion of the exams across the different CEFR levels, deed, able to accurately categorise the examina- including also the average numbers of violations tions according to the CEFR A1, A2, B1, B2 identified by both humans evaluators and the auto- and C1 levels of competence. These, in fact, are matic tool, normalized by the average text length. fed to the model as labels during the training to- Note that the average errors per word decrease as gether with single contextual embeddings, or con- the level of competence increases. Observe also catenated ones if corrections are included. Note that the automaticerrorstendtobemorenumerous that we do not provide the model with any indica- than the human ones, in particular for low compe- tion about the types of errors in the original text. tence levels. We use the official test partition com- Thisinformationisdirectlyextractedbythemodel posed of 1,447 essays. The development set is a whenprocessing the original text together with its 20%subsetofthetraining set. corrected version. 4 Experimental Analysis 4.2 CLC-FCEDataset The CLC-FCEdataset is a collection of texts pro- Weevaluatethearchitectures described above, us- duced by adult learners for English as a Second ing both automatic and human corrections, on or Other Language(ESOL)examinationsfromthe two English open-source datasets: EFCAMDAT First Certificate in English (FCE) written exam and CLC-FCE. We also experiment varying the to attest a B2 CEFR level (Yannakoudakis et al., amount of training material. The performance of 2011). The learners’ productions, consisting of the models is measured in terms of weighted clas- two texts, have been evaluated with a score be- sification accuracy. tween0and5.3andtheerrorshavebeenclassified 4.1 EFCAMDATDataset in 77 classes. Following the guidelines of the au- thors, the average score of the two texts has been The EFCAMDAT dataset constitutes one of the mappedtoCEFRlevels,asshowninTable2. Note largest language learners datasets currently avail- that only 4 levels are available in this dataset and able (Geertzen et al., 2013). The version we use that the labels do not uniformly match the ones contains 1,180,310 essays submitted by adult En- present in EFCAMDAT. Table 2 reports also the glish learners from more than 172 different nation- distributions of the texts across the 4 classes with alities, covering 16 distinct levels compliant with the error partitions. We notice that, in this case, levels n. exams average manualerrors automatic errors length per word per word −2 −2 A1 37,290 40 4·10 10·10 −2 −2 A2 36,618 67 4·10 6·10 −2 −2 B1 18,119 92 4·10 5·10 −2 −2 B2 6,042 129 3·10 4·10 −2 −2 C1 1,732 170 2·10 3·10 Table 1: EFCAMDATdataset (sample of 100,000 exams): number of exams per CEFR level, mean text length (in tokens), mean number of manually and automatically annotated errors per word. scores levels N. exams average manualerrors automatic errors length per word per word −2 −2 0.0 - 1.1 A2 10 220 16·10 7·10 −2 −2 1.2 - 2.3 B1 417 205 14·10 7·10 −2 −2 3.1 - 4.3 B2 1,414 212 9·10 6·10 −2 −2 5.1 - 5.3 C1 265 234 6·10 4·10 Table 2: CLC-FCE dataset: assigned scores and number of exams per CEFR level, mean text length (in tokens), mean number of manually and automatically annotated errors per word. manual errors have been annotated more in de- is based on surface text processing, does not use a tail and they are indeed more numerous than the deepparseranddoesnotrequireafullyformalised automatic ones. In general, the number of er- grammar. By means of this, we have applied the rors is higher than what observed in EFCAMDAT. pre-defined rules for the English language to the Also for this corpus the average amount of errors learners’ essays, generating new correct texts for per word, both automatic and manual, decreases EFCAMDATandforCLC-FCE. Thesewereused as the level increases. The total number of texts as additional input data for the experiments. within the corpus is 2,469. We employed a data partition according to which 2,017 examinations 4.4 Implementation Details constituted the training set, whereas the remain- Our models have been implemented using ing 194 constituted the test set. Differently, 10% Keras and Hugging-Face’s pre-trained BERT- of the training material represented the validation base-uncased architecture (Wolf et al., 2020). The set. From the entire corpus we had to exclude models’ encoder module, consisting of a Multi- 10 texts since they were not provided with an as- HeadAttention and Feed Forward component, re- signed score. Despite its small size, CLC-FCE ceives as inputs the original learners’ exams, to- represents an important resource given its system- gether with additional possible human or auto- atic analysis of errors and the human corrections matic corrections. The transformed contextual provided. embeddings are obtained applying Global Aver- agePoolingtotheoutputsofthepre-trainedfrozen 4.3 LanguageTool BERT Head. The classifier consists of a Dense In both datasets, the content written by language layer of 768 units, with activation function ReLu learners varies according to the levels of compe- and a Dropout rate of 0.2, followed by another tence they were supposed to demonstrate. In ad- Dense layer with less units, 128, and the same ac- 1 dition to the human corrections provided with the tivation function and Dropout rate . data, we have generated automatic corrections us- Lastly, the output layer consists of a Dense layer ing LanguageTool (Miøkowski, 2010), a language with Softmax as activation function and the mod- checker capable of detecting grammatical, syntac- els’ final logits correspond to the different CEFR tical, orthographic and stylistic errors to automat- levels within which the texts are respectively clas- ically correct texts of different nature and length 1https://www.kaggle.com/akensert/bert-base-tf2-0-now- (Naber and others, 2003). The automatic checker huggingface-transformer
no reviews yet
Please Login to review.