139x Filetype PDF File size 0.34 MB Source: aclanthology.org
SiPOS:ABenchmarkDatasetforSindhiPart-of-SpeechTagging WazirAli and Zenglin Xu SMILELab,SchoolofComputerScienceandEngineering University of Electronic Science and Technology of China, Chengdu 611731, China {aliwazirjam,zenglin}@gmail.com Jay Kumar Data Mining Lab,School of Computer Science and Engineering University of Electronic Science and Technology of China, Chengdu 611731, China Abstract Sindhi is a rich and complex morphological lan- guage (Rahman, 2010). It is a low-resource lan- In this paper, we introduce the SiPOS dataset guage (Ali et al., 2019) which lacks primary LRs for part-of-speech tagging in the low-resource for mature computational processing. Sindhi is Sindhi language with quality baselines. The being written in two famous writing systems of dataset consists of more than 293K tokens an- Persian-Arabic, Devanagari, and more recently Ro- notated with sixteen universal part-of-speech man(Sodharetal., 2019) is also getting popularity. categories. Two experienced native annota- However, Persian-Arabic is standard script as well tors annotated the SiPOS using the Doccano as frequently used in literary work, online commu- text annotation tool with an inter-annotation agreement of 0.872. We exploit the condi- nication, and journalism. Sindhi POS tagging has tional random field, the popular bidirectional been previously investigated in various scripts in- long-short-term memory neural model, and cluding Persian-Arabic,Devanagari (Jamro, 2017), self-attention mechanism with various settings and Roman (Sodhar et al., 2019). However, the to evaluate the proposed dataset. Besides low-resource Sindhi language lacks a POS labeled pre-trained GloVe and fastText representation, dataset for its supervised text classification. the character-level representations are incor- In this paper, we introduce a novel benchmark porated to extract character-level information using the bidirectional long-short-term mem- SiPOS dataset for the low-resource Sindhi lan- ory encoder. The high accuracy of 96.25% guage. Due to the scarcity of open-source POS is achieved with the task-specific joint word- labeled data, two native experienced annotators level and character-level representations. The performed the POS annotation of Sindhi text using SiPOS dataset is likely to be a significant re- the Doccano (Nakayama et al., 2018) text anno- source for the low-resource Sindhi language. tation tool. To the best of our knowledge, this is 1 Introduction the first attempt to address the Sindhi POS tagging at a large scale by proposing a new gold-standard 1 Annotated corpus is an essential resource for de- SiPOSdataset and exploiting conditional random veloping automatic natural language processing field (CRF), bidirectional long-short-term memory (NLP) systems (Ali et al., 2020). Such language (BiLSTM)network, and self-attention for its evalu- resources (LRs) play a significant role in the digital ation. Our novel contributions are as follows: survival of human languages (Jamro, 2017). The Wereveal a novel open-source gold-standard part-of-speech (POS) tagging is a fundamental pre- SiPOSdatasetforthelow-resourceSindhilan- processingtaskinvariousNLPapplications(Mahar guage. We manually tagged more than 293k and Memon,2010), used to assign appropriate in- tokens of the Sindhi news corpus using the context POS tags to each word. One of the main Doccanotext annotation tool. ´ challenges (Britvic, 2018) in POS tagging is am- biguity since one word can take several possible We compute the inter-annotator agreement POSlabels. Another problem is the unspoken or and exploit machine learning models of CRF complexPOSorwords. Bothoftheseproblemsare and BiLSTM, self-attention to evaluate the not rare in natural languages. Moreover, there is a proposed dataset with different settings. lack of benchmark labeled datasets for Sindhi POS tagging. To tackle these challenges, we propose a 1The SiPOS dataset is publicly available @ https:// novel benchmark SiPOS tagset. github.com/AliWazir/SiPOS-Dataset 22 Proceedings of the Student Research Workshop associated with RANLP-2021, pages 22–30, held online, Sep 1–3, 2021. https://doi.org/10.26615/issn.2603-2821.2021_004 Besides pre-trained GloVe, fastText word- Table 1: The statistics of news articles utilized for the level representation, the task-specific annotation of SiPOS tagset. character-level word representations are incor- Resource Articles Sentences tokens porated to extract character-level information Kawish 563 3 769 1 58 145 using BiLSTMencoder. Awami-Awaz 458 3 015 1 35 539 2 Related Work Total 1 021 6 784 2 93 684 The labeling of natural language text with POS tags can be a complicated task, requiring much 3.1 Preprocessing effort, even for trained annotators (Rane et al., Sindhi news corpus contains a certain amount of 2020). A large number of LRs are publicly unwanted data (Ali et al., 2019). Thus, filtering available for high-resource languages such as En- out such data and normalizing it is essential to ob- glish (Marcus and Marcinkiewicz), Chinese, Indian tain a more authentic vocabulary for the annotation languages(BaskaranSankaranandSubbarao,2008; project. The preprocessing steps consist of the fol- Khanetal., 2019) and others (Petrov et al., 2012). lowing steps: Unlike rich-resourced languages such as English andChinesewithabundantpubliclyaccessibleLRs, Removal of unwanted multiple punctuation Sindhi is relatively low-resource (Ali et al., 2019), marks from the start and end of the sentences. which lacks the POS tagged dataset that can be uti- lized to train a supervised or statistical algorithm. Filtration of noisy data such as non-Sindhi Previously, Mahar and Memon (2010) proposed words, special characters, HTML tags, emails, a Sindhi POS labeled dataset consists of 33k to- URLs,etc. kens. Later, Dootio and Wagan (2019) published a newdataset containing 6.8K lexicon, which is in- Tokenization to normalize the text, removal sufficient to train a robust supervised classification of duplicates, and multiple white spaces. algorithm. More recently, (Rahman et al., 2020) annotated 100K words by employing a multi-layer Moreover, it requires human efforts and care- annotation model, which comprises different an- ful assessment for the consistency in the labeled notation layers like POS, morphological features, dataset. Sindhi Persian-Arabic is being written dependency structure, and phrase structure. But in the right to left direction (Jamro, 2017). An their dataset is not publicly available. Except for example of a Sindhi sentence is given in Table 2 Sindhi Persian-Arabic, the POS tagged datasets in with language-specific and corresponding univer- Devanagari(Motlanietal.,2015),andRoman(Sod- sal part-of-speech (UPOS) tags. A Sindhi word har et al., 2019) scripts have also been introduced. comprises one or more clitics or segments (Narejo ThePOStaggedcorpusofSindhi-Devanagari con- and Mahar, 2016), typically a stem to which prefix sists of 44K tokens, while Sindhi-Roman (Sodhar and suffix may be attached. Therefore, the tag- et al., 2019) only contains 100 sentences. The re- ging can be done for each clitic in sequence or view of existing work shows that the low-resource a word simultaneously. For the annotation, we Sindhi language lacks a benchmark POS labeled used Doccano (Nakayama et al., 2018) to assign a dataset for its supervised text classification. POSlabeltoeachtoken. The Doccano is an open- source annotation platform for sequence labeling 3 CorpusAcquisition and Annotation and machine translation tasks. We engaged two In this section, we illustrate the opted annotation native graduate students of linguistics for the anno- methodology. We utilized the news corpus (Ali tation purpose. The annotators also used an online Sindhi Thesaurus portal2 in case of ambiguity or et al., 2019) of popular and most circulated Sindhi confusion while deciding a POS label for a token. newspapersofKawishandAwami-Awaz(seeTable Moreover, the project supervisor also worked with 1. Two native graduate students of linguistics were annotators to monitor annotation quality by follow- engaged for the annotation purpose using the Doc- ing the annotation guidelines (Dipper et al., 2004; cano (Nakayama et al., 2018) text annotation tool Petrov et al., 2012). to assign a POS label to each token. The detailed annotation process is illustrated as under: 2http://dic.sindhila.edu.pk/ 23 Table 2: An example of a Sindhi sentence with its corresponding language specific and universal part-of-speech tags. The Roman transliteration of each token is given for the ease of reading. . يهآ ودنيو ويارڪ لمع ناس يتخس يت ايڊيم ڪنارٽڪيلا ۽ ٽنرپ Sentence SYM AUX VERB VERB NOUN ADP NOUN ADP NOUN NOUN CONJ NOUN UPOS PUNCT AUX VB VB NN ADP NN ADP NN NN CONJ NN Tag يناشن يج ڪهيب نواعم لعف لعف لعف مسا رج فرح مسا رج فرح مسا مسا ولمج فرح مسا ِ ِ Sindhi POS 3.2 Consistency Evaluation 4.1 Conditional Random Field Toensure annotation consistency, we measure the We initially evaluate the SiPOS dataset using a inter-annotator agreement to investigate the consis- CRF (Lafferty et al., 2001), widely used in se- tency in which annotators agreed to the tags. To quence classification (Sutton et al., 2012) tasks. measure inter-annotator agreement, we chose to TheCRFisusefultoconsidertherelationship be- use Cohen’s Kappa (Cohen, 1960). Cohen’s Kappa tween labels and jointly decode the most suitable measures the inter-annotator agreement between chain of labels for a given input sentence (Huang two annotators. Since we have two annotators, we et al., 2015). compute this measure for POS tag pairs that show 4.2 Representation Learning agreement between two annotators, which leads to tworesults. The inter-annotator agreement comes Representation learning aims to capture the use- out to be 0.872 with a confidence percentile of 95%. ful semantic, syntactic, and morphological in- Cohen’s Kappa value shows that the dataset is of formation (Santos and Zadrozny, 2014) in NLP acceptable quality. tasks (Bojanowski et al., 2017). We use pre-trained GloVe (Pennington et al., 2014), fastText (Bo- 3.3 SiPOSDataset janowski et al., 2017) word-level representations, character-level representations as well as joint TheSiPOShasbeenannotatedusingthenewscor- character-level and word-level WordCharacter rep- pus (Ali et al., 2019) of Kawish and Awami-Awaz resentations (Shao et al., 2017; Matteson et al., Sindhi newspapers. Sindhi grammar (Oad, 2012) 2018) to extract the word-level features. Pre- give Sindhi POS of nouns, verbs, adjectives, ad- trained word representations enable neural mod- verbs, pronouns, prepositions, conjunctions, nu- els to exploit the raw textual data larger than an- merals, articles, and interjections. The dataset con- notated data. Then, we jointly learn the task- sists of more than 293k tokens annotated with six- specific character-level word representations (Liu teenSindhiPOSandUPOScategories,respectively. et al., 2018) using the BiLSTM network. The task- Thecomplete statistics of the utilized news corpus specificcontextualrepresentationsincludethePOS- in the annotation is given in Table 1. The detailed based knowledge. label distribution in the SiPOS is given in Table 3. 4.2.1 GloVe TheGloVe(Pennington et al., 2014) is a word rep- 4 Evaluation Methods resentation model that relies on two methods of global word-to-word co-occurrence statistics and Weevaluate the SiPOS for the consistency in the local context window. We obtain the pre-trained dataset by computingtheinter-annotatoragreement GloVerepresentation by training on the large cor- using Cohen’s Kappa (Cohen, 1960) coefficient. pus of more than 61 million words (Ali et al., WeevaluatetheproposedSiPOSdatasetbyexploit- 2019). We train GloVe with AdaGrad by choosing ing CRF, BiLSTM and integrating CRF and self- the context window of 5 and the 300-dimensional attention in the BiLSTM network for strong base- word representations. We filter out Sindhi stop lines. Moreover, pre-trained GloVe, fastText word words (Ali et al., 2019) in the preprocessing step. representations, and task-specific character-level, 4.2.2 fastText and joint WordCharacter level representations are incorporated to extract word-level and character- ThefastText (Bojanowski et al., 2017) is similar to level information using the BiLSTM encoder. Word2vec(Mikolovetal., 2013). It uses subword 24 Table 3: Complete statistics of SiPOS dataset with the number of POS in each label. The highest proportion in the POSlabels is noun, followed by preposition and verb. information in the prediction model to obtain word through BiLSTM network (Shao et al., 2017; Mat- representations. We train fastText on recently pro- teson et al., 2018) which are different from pre- posed unlabelled Sindhi corpus (Ali et al., 2019) trained word representations. The BiLSTM is good of more than 61 million words. In training, we at capturing prefixes and suffixes from the given in- use the recommended sub-sampling (Bojanowski put text (Zhang et al., 2018). It consists of intercon- −−−−→ et al., 2017), negative sampling, the minimum and nectedbidirectional forward LSTM andbackward ←−−−− maximumlengthofcharacter ngrams (Grave et al., LSTMhiddenlayers,whichefficientlyencodethe 2018), minimum word count, learning rate, 300- contextual information. dimensional representations, and default context 4.3 Neural POSTaggers windowsize. 4.3.1 BiLSTM 4.2.3 Character-level Word Representations TheBiLSTMnetwork(SchusterandPaliwal,1997) Thecharacter-level representations have an advan- has been broadly used in a variety of sequence la- tage in handling the out-Of-the-vocabulary prob- belling tasks (Huang et al., 2015; Ma and Hovy, lem because they can learn almost all character 2016; Peters et al., 2017) including POS tag- representations from even small or moderate cor- ging (Kann et al., 2018). In this work, we evaluate pus (Jia and Ma, 2019). In other words, these rep- the SiPOSdataset using the BiLSTM network. The resentations are good at inferring unseen words modelconsists of representations layer, BiLSTM and sharing information about morpheme-level encoder, and softmax for each position in the final regularities. The BiLSTM network learns the layer. The bidirectional layers extract character- character-level representations of words and as- level, word-level features and then adopt a random sociates them with usual word representations to initialization method to transform words into rep- perform POS tagging. We employ task-oriented resentations. The BiLSTM word-level (pre-train) strategy (Liu et al., 2018) for character-level and model is the same as BiLSTM (word-level) but joint WordCharacter level representations learned adoptsGloVeandfastTextforrepresentations. Sim- 25
no reviews yet
Please Login to review.