183x Filetype PDF File size 0.52 MB Source: aclanthology.org
HowtoObtainReliableLabelsforMBTIClassificationfromTexts? ˇ Sanja Stajner Seren Yenikent SymantoResearch SymantoResearch Nuremberg, Germany Nuremberg, Germany sanja.stajner@symanto.com seren.yenikent@symanto.com Abstract popularity of MBTI framework (it is estimated that AutomaticdetectionoftheMyers-BriggsType morethan2million US adults complete the inven- 2 Indicator (MBTI) from short posts attracted tory every year), there is a number of freely avail- noticeable attention in the last few years. Re- able alternative MBTI questionnaires on the inter- cent studies showed that this is quite a diffi- net, with the 16personalities test3 being one of the cult task, especially on commonlyusedTwitter most popular ones. According to the Myers-Briggs data. Obtaining MBTI labels is also difficult, 4 5 as human annotation requires trained psychol- Foundation and the 16personality test website, ogists, and automatic way of obtaining them both questionnaires satisfy the accepted standards is through long questionnaires of questionable for test validity and reliability. Nevertheless, the usability for the task. In this paper, we present MBTI questionnaires have received a noticeable a method for collecting reliable MBTI labels criticism from the academic community (Pittenger, via only four carefully selected questions that 1993; Boyle, 1995) for not relying on a scientif- can be applied to any type of textual data. ically proven (i.e. data-driven) background, but 1 Introduction rather on qualitative measures such as observation and introspection. The other common criticism is TheMyers-Briggs Type Indicator (MBTI) model the binary nature of the questionnaire as it is known (Briggs-Myers and Myers, 1995) is one of the most that the majority of people usually lies somewhere widely used non-clinical psychometric models in the middle of the scales (Pittenger, 1993). ˇ (Stajner and Yenikent, 2020). It classifies people The questionnaire-based personality detection into two groups across four dimensions: extraver- has several weaknesses: it requires trained human sion/introversion (E/I), sensing/intuition (S/N), assessors; it is prone to social desirability bias thinking/feeling (T/F), and judgement/perception (Krumpal, 2011) and reference-group effect (Heine (J/P). This leads to a total of 16 personality types. et al., 2002); it is questionable if answering ques- The first three dimensions are based on the theo- tionnaires is a natural way of showing ones per- retical work of Carl Jung (1921), while the fourth sonality (as opposed to free writing or behaviour dimension was added later by Myers and Briggs- “whennobodywatches”). To detect MBTI typolo- Myers(1995). The MBTI personality framework gies in a more natural way and without necessity has already been used for decades in educational for trained human assessors, many studies have and industry settings, e.g. for finding jobs that best attempted at building systems for automatic de- resonate with the person’s preferences for informa- tection of MBTI personality types from text in tion processing (S/N and T/F dimensions), finding the last several years. Attempts have been made workorganization types that best resonate with the for automatic detection of MBTI personality types person’s preferred judgement processes (J/P dimen- from: tweets written in English (Plank and Hovy, sion) thus leading to better job satisfaction, and 2015), six other Western European languages (Ver- for better matching work environments with the person’s preferences (E/I dimension) to lower em- professional/versions-of-the-mbti-questionnaire/ 2https://www.verywellmind.com/the-myers-briggs-type- ployee turnover (Briggs-Myers and Myers, 1995). indicator-2795583#the-mbti-today The original MBTI questionnaire contains 93 3https://www.16personalities.com/free-personality-test questions and is not freely available.1 Due to the 4myersbriggs.org 5https://www.16personalities.com/articles/reliability-and- 1https://www.myersbriggs.org/using-type-as-a- validity 1360 Proceedings of Recent Advances in Natural Language Processing, pages 1360–1368 Sep 1–3, 2021. https://doi.org/10.26615/978-954-452-072-4_152 hoeven et al., 2016), and Japanese (Yamada et al., it is known that many people have characteristics of 2019); English posts collected from Personality both polarities across MBTI dimensions (Pittenger, 6 7 Cafe forum available in Kaggle; and English 1993), such filtering of training datasets might lead ´ ˇ Reddit comments (Gjurkovic and Snajder, 2018; to better performances of automatic systems for ´ Gjurkovic et al., 2020). Despite being trained on MBTIdetection from texts by removing noise. large amounts of textual data (over one million), andmodelledasfourbinaryclassificationtasks,the 2 Related Work best systems performed only slightly better than the randomandmajority-classbaselines, regardless Plank and Hovy (2015) were the first to explore of the architecture used. the use of Twitter data for obtaining a large-scale Some studies suggested that tweets might not dataset for open-vocabulary automatic detection contain sufficient amounts of MBTI signals (even of MBTI personality traits. They collected a cor- after concatenating up to 150-200 tweets per user) pus of 1.2M English tweets automatically labelled due to the nature of Twitter posts (Celli and Lepri, for gender and MBTI type. To identify the users ˇ for whom an MBTI type can be automatically as- 2018; Stajner and Yenikent, 2020, 2021). An- signed, the authors relied on mentions of any of other issue with all those studies and obtained the 16 MBTI types plus the word “Briggs”. Addi- results might be that the systems are supervised tionally, each user was labelled as female or male and were trained with gold labels obtained via whenever it was discernible; those users for whom MBTI questionnaires that suffer from all earlier the gender was not discernible were excluded from ˇ mentionedweaknesses. Inourrecentstudy(Stajner the study. For each selected Twitter user, the au- and Yenikent, 2021), we found a low association thors collected up to 2000 most recent tweets (to be between the MBTI types obtained via question- included, each user had to have at least 100 tweets). naires and the MBTI signals found in the short Plank and Hovy (2015) found that the distribution texts written by participants (tweets and free texts of MBTI types across the selected Twitter users oncarefully chosen topics). At the same time, the significantly differs from the distribution of MBTI inter-annotator agreement of two expert annotators types across the general US population. The au- assigning MBTI types based on those free texts thors further trained binary classification models ˇ wasquite high (Stajner and Yenikent, 2021). (for each MBTI dimension separately) using vari- Contributions. To avoid all previously men- ous features and model architectures. The best sys- tioned problems in automatic MBTI detection from tems outperformed majority-class baselines only texts, in this study, we propose a carefully designed for I/E and T/F dimensions. set of four questions with answers on a 1-5 scale Verhoeven et al. (2016) used a similar strategy (Section 3) that aim to capture the main MBTI for obtaining large-scale MBTI datasets for six characteristics without taking much time from par- other languages: German, Italian, Dutch, French, ticipants, and can be administered together with Portuguese, and Spanish. As opposed to the work any open-end questions without need for trained of Plank and Hovy (2015), the triggers for identify- human assessors. The validity of our question- ing users whose MBTI types can be automatically naire has been assessed via expert human anno- assigned were mentions of one of the 16 personal- tation following previously proposed annotation ity types and the word “personality” or pronouns ˇ methodology (Stajner and Yenikent, 2021). The andverbformssuchas“Iam”or“Ihave”,foreach agreement between the answers to the newly pro- of the six languages. All retrieved contexts were posed questions and the expert human annotations manually checked for whether or not they describe wasfoundtobesimilar as between two trained an- the personality of the writer of the post. For all notators (Section 5.2). Another advantage of the users whose posts passed this check, the gender proposed method is that it goes beyond binary ty- was annotated based on the user’s name, handle, pology, by offering a 5-point scale for each MBTI description, and profile picture (Verhoeven et al., dimension. This creates a possibility for filtering 2016). Distributions of MBTI types across Twitter out those instances written by people who exhibit users of the six languages were found to be similar, similar amount of signals from both polarities. As with only a few exceptions (Verhoeven et al., 2016). 6https://www.personalitycafe.com/ Theauthors also trained binary classifiers using the 7https://www.kaggle.com/datasnaek/mbti-type dataset with 200 concatenated tweets for each user 1361 and LinearSVC classifier with binary word and character n-gram features. Similar as for English (Plank and Hovy, 2015), in most of the languages, the best classifiers outperformed the majority-class baselines only for E/I and T/F dimensions. ´ ˇ Gjurkovic and Snajder (2018) compiled a large- scale MBTIdatasetfromEnglishRedditcomments by relying on flairs—short introductions of users on various subreddits—which, in the case of the MBTI-related subreddits, usually contain the users’ ´ MBTIresults. In the subsequent study (Gjurkovic et al., 2020), dataset was further enriched with de- mographic information about the users (age, gen- der, location, and language), and the labels for two otherpersonalitymodels. ThedistributionofMBTI typesinthisdatasetalsosignificantlydeviatedfrom the general US population (see Figure 3 in Sec- tion 6 for comparison of MBTI type distribution amongdifferent populations/datasets). Automatic assignment of MBTI type to each user in all above-mentioned studies is based on automatic extraction of contexts in which a cer- Figure 1: Demographic questions. tain MBTI type is mentioned. Without man- ual inspection of each such mention—which was via popular questionnaires), which might be an only reported for the study by Verhoeven et al. indication that MBTI results obtained via question- (2016))—the assigned labels might not be reliable, naires do not resonate well with the MBTI signals as they may refer to someone else mentioned in the found in more natural textual forms. tweet and not the writer of the tweet, or they might Thecurrent study aims to overcome previously be a part of a larger phrase, e.g. “I think/believe I reported issues by proposing four questions with amanINTP”or“IexpecttogetESFJastheresult the answers on a 1–5 scale to obtain MBTI labels if I do personality assessment”. that better resonate with the expert human MBTI Tothebest of our knowledge, the only study in annotations on short texts. whichMBTIlabelswereobtainedbyexplicitlyask- ing participants to report their MBTI type, if they 3 Questionnaire had done an MBTI personality test in the past, is ˇ Thewholequestionnaire consisted of one optional our recent study (Stajner and Yenikent, 2021). The AmazonMechanicalTurkworkerswerealsoasked question “YoumighthaveobtainedyourMBTItype to describe their favourite type of vacations and in the past via questionnaires. If you know your preferred hobbies in minimum 300 characters each. MBTItype,please type it here”, four compulsory Wefoundthatthis type of texts (responses to care- demographic questions, four compulsory questions fully selected open-end questions) contain more with answers on a 1–5 scale that aimed to capture MBTI signals than tweets (even if concatenated the participants MBTI type, and two compulsory together for each user). We further proposed de- open-end questions. Demographic questions en- tailed guidelines for MBTI personality annotation compassed gender, age, whether or not English is from textual data, and showed that expert human their native language, and the highest level of ed- annotators have a high level of agreement among ucation obtained (Figure 1). The gender question themselves on obtained textual answers when fol- had four possible answers: female, male, other, lowing provided guidelines. At the same time, we prefer not to specify. Five age groups were offered found that the annotators have a low level of agree- to choose from: 18–25, 26–35, 36–45, 46–55, and ment with the MBTI types reported by participants over 55. (based on their previous MBTI personality testing After answering demographic questions, partici- 1362 intuitive, by asking whether they prefer technical andhands-onhobbies(1=sensing)orabstractand imaginative (5 = intuitive). The third MBTI di- mension (T/F) is fundamentally about how people maketheir decisions, whether based on rational or emotional motives. As people do not engage with strict decision-making processes during their free time, which is ultimately based on their personal interests, the question measured the preference for rational (1 = thinking) or emotional (5 = feeling) reasoning for liking a certain hobby. The fourth question aimed to capture the preference for spon- taneous and flexible (1 = perceiving), or a well- planned (5 = judging) schedule at vacations. We initially prepared two questions per each MBTIdimensionandperformedapilotstudywith 30participants to choose those questions (Figure 2) that better correspond to the MBTI types provided bythe participants, and the MBTI annotations by two annotators. Finally, participants were asked to answer to two Figure 2: Questions for obtaining MBTI labels. open-end questions, which we previously proposed ˇ (Stajner and Yenikent, 2021) as the optimal ques- tions for annotating MBTI types from texts: pants were provided with four questions that aimed to capture their MBTI type, and were asked to pro- • Describe which kind of vacations you typi- vide an answer on a 1–5 points scale. Those four cally enjoy and why. questions are the central contribution of this study. • Describe what type of hobbies you enjoy and Byfollowing the idea that aspects of leisure time why. represent the most natural version of personality, as it is directed by high degrees of intrinsic motivation The two questions were preceded by the follow- ˇ ing instructions: “The following questions aim to (Stajner and Yenikent, 2021), the questions are fo- cussed on typical leisure time activities—hobbies understand your life style preferences. While an- and vacations. This also gave us the opportunity swering, please write down the first things that to utilize the previously proposed open-end ques- cometoyourmindwithout much contemplation.” ˇ To be accepted, each answer needed to contain a tions (Stajner and Yenikent, 2021) in the validation process (Section 5). In deciding the content of minimumof300characters. the questions for each individual dimension, we 4 Challenges in Data Collection followed the main definitions provided by Briggs- Myers and Myers (1995). Although each MBTI Data was collected via Amazon Mechanical Turk dimensioncorrespondstomultiplepracticalandbe- (AMT) platform. We prepared the questionnaire havioral characteristics, the core theoretical focus as Google Forms and provided the link to it in for every dimension is consistent. the HIT of the AMT platform. We experimented The first question (for the E/I dimension) was withvarioussetupsintheplatform: differentvalues designed with the idea of capturing whether the for monetary compensations, allowing only those person prefers to be surrounded by people and participants with high scores on previous tasks, social interactions, on one end of the scale (1 = different times for validation of the answers and extraverted), or to spend quiet and calm time by payment. The only variable that noticeably influ- themselves, on the other end of the scale (5 = in- enced the time needed for obtaining completed troverted). The second question (for the S/N di- HITs was whether or not we restrict the partici- mension) aims to capture the characteristics of the pants according to their performance on the pre- tasks people would prefer to process, concrete or vious HITs. Without any restrictions, we were 1363
no reviews yet
Please Login to review.