Arabic Pdf 103869 | Ranlp Sr4

Partial capture of text on file.
                          SiPOS:ABenchmarkDatasetforSindhiPart-of-SpeechTagging
                                                            WazirAli and Zenglin Xu
                                         SMILELab,SchoolofComputerScienceandEngineering
                          University of Electronic Science and Technology of China, Chengdu 611731, China
                                                 {aliwazirjam,zenglin}@gmail.com
                                                                      Jay Kumar
                                      Data Mining Lab,School of Computer Science and Engineering
                          University of Electronic Science and Technology of China, Chengdu 611731, China
                                         Abstract                                  Sindhi is a rich and complex morphological lan-
                                                                                guage (Rahman, 2010). It is a low-resource lan-
                       In this paper, we introduce the SiPOS dataset            guage (Ali et al., 2019) which lacks primary LRs
                       for part-of-speech tagging in the low-resource           for mature computational processing. Sindhi is
                       Sindhi language with quality baselines. The              being written in two famous writing systems of
                       dataset consists of more than 293K tokens an-            Persian-Arabic, Devanagari, and more recently Ro-
                       notated with sixteen universal part-of-speech            man(Sodharetal., 2019) is also getting popularity.
                       categories.   Two experienced native annota-             However, Persian-Arabic is standard script as well
                       tors annotated the SiPOS using the Doccano               as frequently used in literary work, online commu-
                       text annotation tool with an inter-annotation
                       agreement of 0.872.     We exploit the condi-            nication, and journalism. Sindhi POS tagging has
                       tional random ﬁeld, the popular bidirectional            been previously investigated in various scripts in-
                       long-short-term memory neural model, and                 cluding Persian-Arabic,Devanagari (Jamro, 2017),
                       self-attention mechanism with various settings           and Roman (Sodhar et al., 2019). However, the
                       to evaluate the proposed dataset.       Besides          low-resource Sindhi language lacks a POS labeled
                       pre-trained GloVe and fastText representation,           dataset for its supervised text classiﬁcation.
                       the character-level representations are incor-              In this paper, we introduce a novel benchmark
                       porated to extract character-level information
                       using the bidirectional long-short-term mem-             SiPOS dataset for the low-resource Sindhi lan-
                       ory encoder.    The high accuracy of 96.25%              guage. Due to the scarcity of open-source POS
                       is achieved with the task-speciﬁc joint word-            labeled data, two native experienced annotators
                       level and character-level representations. The           performed the POS annotation of Sindhi text using
                       SiPOS dataset is likely to be a signiﬁcant re-           the Doccano (Nakayama et al., 2018) text anno-
                       source for the low-resource Sindhi language.             tation tool. To the best of our knowledge, this is
                   1   Introduction                                             the ﬁrst attempt to address the Sindhi POS tagging
                                                                                at a large scale by proposing a new gold-standard
                                                                                                1
                   Annotated corpus is an essential resource for de-            SiPOSdataset and exploiting conditional random
                   veloping automatic natural language processing               ﬁeld (CRF), bidirectional long-short-term memory
                   (NLP) systems (Ali et al., 2020). Such language              (BiLSTM)network, and self-attention for its evalu-
                   resources (LRs) play a signiﬁcant role in the digital        ation. Our novel contributions are as follows:
                   survival of human languages (Jamro, 2017). The                   Wereveal a novel open-source gold-standard
                   part-of-speech (POS) tagging is a fundamental pre-                SiPOSdatasetforthelow-resourceSindhilan-
                   processingtaskinvariousNLPapplications(Mahar                      guage. We manually tagged more than 293k
                   and Memon,2010), used to assign appropriate in-                   tokens of the Sindhi news corpus using the
                   context POS tags to each word. One of the main                    Doccanotext annotation tool.
                                        ´
                   challenges (Britvic, 2018) in POS tagging is am-
                   biguity since one word can take several possible                 We compute the inter-annotator agreement
                   POSlabels. Another problem is the unspoken or                     and exploit machine learning models of CRF
                   complexPOSorwords. Bothoftheseproblemsare                         and BiLSTM, self-attention to evaluate the
                   not rare in natural languages. Moreover, there is a               proposed dataset with different settings.
                   lack of benchmark labeled datasets for Sindhi POS
                   tagging. To tackle these challenges, we propose a               1The SiPOS dataset is publicly available @ https://
                   novel benchmark SiPOS tagset.                                github.com/AliWazir/SiPOS-Dataset
                                                                            22
                                   Proceedings of the Student Research Workshop associated with RANLP-2021, pages 22–30,
                                                                 held online, Sep 1–3, 2021.
                                                      https://doi.org/10.26615/issn.2603-2821.2021_004
                      Besides pre-trained GloVe, fastText word-         Table 1: The statistics of news articles utilized for the
                       level   representation,    the   task-speciﬁc      annotation of SiPOS tagset.
                       character-level word representations are incor-     Resource         Articles   Sentences     tokens
                       porated to extract character-level information      Kawish             563        3 769      1 58 145
                       using BiLSTMencoder.                                Awami-Awaz         458        3 015      1 35 539
                 2    Related Work                                         Total             1 021       6 784      2 93 684
                 The labeling of natural language text with POS
                 tags can be a complicated task, requiring much           3.1  Preprocessing
                 effort, even for trained annotators (Rane et al.,        Sindhi news corpus contains a certain amount of
                 2020).    A large number of LRs are publicly             unwanted data (Ali et al., 2019). Thus, ﬁltering
                 available for high-resource languages such as En-        out such data and normalizing it is essential to ob-
                 glish (Marcus and Marcinkiewicz), Chinese, Indian        tain a more authentic vocabulary for the annotation
                 languages(BaskaranSankaranandSubbarao,2008;              project. The preprocessing steps consist of the fol-
                 Khanetal., 2019) and others (Petrov et al., 2012).       lowing steps:
                 Unlike rich-resourced languages such as English
                 andChinesewithabundantpubliclyaccessibleLRs,                 Removal of unwanted multiple punctuation
                 Sindhi is relatively low-resource (Ali et al., 2019),         marks from the start and end of the sentences.
                 which lacks the POS tagged dataset that can be uti-
                 lized to train a supervised or statistical algorithm.        Filtration of noisy data such as non-Sindhi
                    Previously, Mahar and Memon (2010) proposed                words, special characters, HTML tags, emails,
                 a Sindhi POS labeled dataset consists of 33k to-              URLs,etc.
                 kens. Later, Dootio and Wagan (2019) published a
                 newdataset containing 6.8K lexicon, which is in-             Tokenization to normalize the text, removal
                 sufﬁcient to train a robust supervised classiﬁcation          of duplicates, and multiple white spaces.
                 algorithm. More recently, (Rahman et al., 2020)
                 annotated 100K words by employing a multi-layer            Moreover, it requires human efforts and care-
                 annotation model, which comprises different an-          ful assessment for the consistency in the labeled
                 notation layers like POS, morphological features,        dataset. Sindhi Persian-Arabic is being written
                 dependency structure, and phrase structure. But          in the right to left direction (Jamro, 2017). An
                 their dataset is not publicly available. Except for      example of a Sindhi sentence is given in Table 2
                 Sindhi Persian-Arabic, the POS tagged datasets in       with language-speciﬁc and corresponding univer-
                 Devanagari(Motlanietal.,2015),andRoman(Sod-              sal part-of-speech (UPOS) tags. A Sindhi word
                 har et al., 2019) scripts have also been introduced.     comprises one or more clitics or segments (Narejo
                 ThePOStaggedcorpusofSindhi-Devanagari con-               and Mahar, 2016), typically a stem to which preﬁx
                 sists of 44K tokens, while Sindhi-Roman (Sodhar          and sufﬁx may be attached. Therefore, the tag-
                 et al., 2019) only contains 100 sentences. The re-       ging can be done for each clitic in sequence or
                 view of existing work shows that the low-resource        a word simultaneously. For the annotation, we
                 Sindhi language lacks a benchmark POS labeled            used Doccano (Nakayama et al., 2018) to assign a
                 dataset for its supervised text classiﬁcation.           POSlabeltoeachtoken. The Doccano is an open-
                                                                          source annotation platform for sequence labeling
                 3    CorpusAcquisition and Annotation                    and machine translation tasks. We engaged two
                 In this section, we illustrate the opted annotation      native graduate students of linguistics for the anno-
                 methodology. We utilized the news corpus (Ali            tation purpose. The annotators also used an online
                                                                          Sindhi Thesaurus portal2 in case of ambiguity or
                 et al., 2019) of popular and most circulated Sindhi      confusion while deciding a POS label for a token.
                 newspapersofKawishandAwami-Awaz(seeTable                 Moreover, the project supervisor also worked with
                 1. Two native graduate students of linguistics were      annotators to monitor annotation quality by follow-
                 engaged for the annotation purpose using the Doc-        ing the annotation guidelines (Dipper et al., 2004;
                 cano (Nakayama et al., 2018) text annotation tool        Petrov et al., 2012).
                 to assign a POS label to each token. The detailed
                 annotation process is illustrated as under:                 2http://dic.sindhila.edu.pk/
                                                                       23
                            Table 2: An example of a Sindhi sentence with its corresponding language speciﬁc and universal part-of-speech
                            tags. The Roman transliteration of each token is given for the ease of reading.
                                      .                يهآ         ودنيو      ويارڪ        لمع          ناس        يتخس         يت         ايڊيم     ڪنارٽڪيلا           ۽           ٽنرپ        Sentence
                                    SYM               AUX         VERB       VERB        NOUN          ADP        NOUN         ADP        NOUN         NOUN            CONJ         NOUN           UPOS
                                  PUNCT               AUX           VB         VB          NN          ADP          NN         ADP          NN           NN            CONJ           NN            Tag
                               يناشن يج ڪهيب       نواعم لعف        لعف        لعف         مسا        رج فرح        مسا       رج فرح        مسا          مسا         ولمج فرح        مسا
                                                                                                         ِ                       ِ                                                             Sindhi POS
                            3.2       Consistency Evaluation                                                           4.1       Conditional Random Field
                            Toensure annotation consistency, we measure the                                            We initially evaluate the SiPOS dataset using a
                            inter-annotator agreement to investigate the consis-                                       CRF (Lafferty et al., 2001), widely used in se-
                            tency in which annotators agreed to the tags. To                                           quence classiﬁcation (Sutton et al., 2012) tasks.
                            measure inter-annotator agreement, we chose to                                             TheCRFisusefultoconsidertherelationship be-
                            use Cohen’s Kappa (Cohen, 1960). Cohen’s Kappa                                             tween labels and jointly decode the most suitable
                            measures the inter-annotator agreement between                                             chain of labels for a given input sentence (Huang
                            two annotators. Since we have two annotators, we                                           et al., 2015).
                            compute this measure for POS tag pairs that show                                           4.2       Representation Learning
                            agreement between two annotators, which leads to
                            tworesults. The inter-annotator agreement comes                                            Representation learning aims to capture the use-
                            out to be 0.872 with a conﬁdence percentile of 95%.                                        ful semantic, syntactic, and morphological in-
                            Cohen’s Kappa value shows that the dataset is of                                           formation (Santos and Zadrozny, 2014) in NLP
                            acceptable quality.                                                                        tasks (Bojanowski et al., 2017). We use pre-trained
                                                                                                                       GloVe (Pennington et al., 2014), fastText (Bo-
                            3.3       SiPOSDataset                                                                     janowski et al., 2017) word-level representations,
                                                                                                                       character-level representations as well as joint
                            TheSiPOShasbeenannotatedusingthenewscor-                                                   character-level and word-level WordCharacter rep-
                            pus (Ali et al., 2019) of Kawish and Awami-Awaz                                            resentations (Shao et al., 2017; Matteson et al.,
                            Sindhi newspapers. Sindhi grammar (Oad, 2012)                                              2018) to extract the word-level features.                                      Pre-
                            give Sindhi POS of nouns, verbs, adjectives, ad-                                           trained word representations enable neural mod-
                            verbs, pronouns, prepositions, conjunctions, nu-                                           els to exploit the raw textual data larger than an-
                            merals, articles, and interjections. The dataset con-                                      notated data.              Then, we jointly learn the task-
                            sists of more than 293k tokens annotated with six-                                         speciﬁc character-level word representations (Liu
                            teenSindhiPOSandUPOScategories,respectively.                                               et al., 2018) using the BiLSTM network. The task-
                            Thecomplete statistics of the utilized news corpus                                         speciﬁccontextualrepresentationsincludethePOS-
                            in the annotation is given in Table 1. The detailed                                        based knowledge.
                            label distribution in the SiPOS is given in Table 3.                                       4.2.1        GloVe
                                                                                                                       TheGloVe(Pennington et al., 2014) is a word rep-
                            4      Evaluation Methods                                                                  resentation model that relies on two methods of
                                                                                                                       global word-to-word co-occurrence statistics and
                            Weevaluate the SiPOS for the consistency in the                                            local context window. We obtain the pre-trained
                            dataset by computingtheinter-annotatoragreement                                            GloVerepresentation by training on the large cor-
                            using Cohen’s Kappa (Cohen, 1960) coefﬁcient.                                              pus of more than 61 million words (Ali et al.,
                            WeevaluatetheproposedSiPOSdatasetbyexploit-                                                2019). We train GloVe with AdaGrad by choosing
                            ing CRF, BiLSTM and integrating CRF and self-                                              the context window of 5 and the 300-dimensional
                            attention in the BiLSTM network for strong base-                                           word representations. We ﬁlter out Sindhi stop
                            lines. Moreover, pre-trained GloVe, fastText word                                          words (Ali et al., 2019) in the preprocessing step.
                            representations, and task-speciﬁc character-level,                                         4.2.2        fastText
                            and joint WordCharacter level representations are
                            incorporated to extract word-level and character-                                          ThefastText (Bojanowski et al., 2017) is similar to
                            level information using the BiLSTM encoder.                                                Word2vec(Mikolovetal., 2013). It uses subword
                                                                                                                  24
                 Table 3: Complete statistics of SiPOS dataset with the number of POS in each label. The highest proportion in the
                 POSlabels is noun, followed by preposition and verb.
                 information in the prediction model to obtain word    through BiLSTM network (Shao et al., 2017; Mat-
                 representations. We train fastText on recently pro-   teson et al., 2018) which are different from pre-
                 posed unlabelled Sindhi corpus (Ali et al., 2019)     trained word representations. The BiLSTM is good
                 of more than 61 million words. In training, we        at capturing preﬁxes and sufﬁxes from the given in-
                 use the recommended sub-sampling (Bojanowski          put text (Zhang et al., 2018). It consists of intercon-
                                                                                                   −−−−→
                 et al., 2017), negative sampling, the minimum and     nectedbidirectional forward LSTM andbackward
                                                                       ←−−−−
                 maximumlengthofcharacter ngrams (Grave et al.,        LSTMhiddenlayers,whichefﬁcientlyencodethe
                 2018), minimum word count, learning rate, 300-        contextual information.
                 dimensional representations, and default context      4.3   Neural POSTaggers
                 windowsize.
                                                                       4.3.1   BiLSTM
                 4.2.3  Character-level Word Representations           TheBiLSTMnetwork(SchusterandPaliwal,1997)
                 Thecharacter-level representations have an advan-     has been broadly used in a variety of sequence la-
                 tage in handling the out-Of-the-vocabulary prob-      belling tasks (Huang et al., 2015; Ma and Hovy,
                 lem because they can learn almost all character       2016; Peters et al., 2017) including POS tag-
                 representations from even small or moderate cor-      ging (Kann et al., 2018). In this work, we evaluate
                 pus (Jia and Ma, 2019). In other words, these rep-    the SiPOSdataset using the BiLSTM network. The
                 resentations are good at inferring unseen words       modelconsists of representations layer, BiLSTM
                 and sharing information about morpheme-level          encoder, and softmax for each position in the ﬁnal
                 regularities. The BiLSTM network learns the           layer. The bidirectional layers extract character-
                 character-level representations of words and as-      level, word-level features and then adopt a random
                 sociates them with usual word representations to      initialization method to transform words into rep-
                 perform POS tagging. We employ task-oriented          resentations. The BiLSTM word-level (pre-train)
                 strategy (Liu et al., 2018) for character-level and   model is the same as BiLSTM (word-level) but
                 joint WordCharacter level representations learned     adoptsGloVeandfastTextforrepresentations. Sim-
                                                                    25
The words contained in this file might help you see if this file matches what you are looking for:

...Sipos abenchmarkdatasetforsindhipart of speechtagging wazirali and zenglin xu smilelab schoolofcomputerscienceandengineering university electronic science technology china chengdu aliwazirjam gmail com jay kumar data mining lab school computer engineering abstract sindhi is a rich complex morphological lan guage rahman it low resource in this paper we introduce the dataset ali et al which lacks primary lrs for part speech tagging mature computational processing language with quality baselines being written two famous writing systems consists more than k tokens an persian arabic devanagari recently ro notated sixteen universal man sodharetal also getting popularity categories experienced native annota however standard script as well tors annotated using doccano frequently used literary work online commu text annotation tool inter agreement exploit condi nication journalism pos has tional random eld popular bidirectional been previously investigated various scripts long short term memory...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area