260x Filetype PDF File size 0.34 MB Source: aclanthology.org
SiPOS:ABenchmarkDatasetforSindhiPart-of-SpeechTagging
WazirAli and Zenglin Xu
SMILELab,SchoolofComputerScienceandEngineering
University of Electronic Science and Technology of China, Chengdu 611731, China
{aliwazirjam,zenglin}@gmail.com
Jay Kumar
Data Mining Lab,School of Computer Science and Engineering
University of Electronic Science and Technology of China, Chengdu 611731, China
Abstract Sindhi is a rich and complex morphological lan-
guage (Rahman, 2010). It is a low-resource lan-
In this paper, we introduce the SiPOS dataset guage (Ali et al., 2019) which lacks primary LRs
for part-of-speech tagging in the low-resource for mature computational processing. Sindhi is
Sindhi language with quality baselines. The being written in two famous writing systems of
dataset consists of more than 293K tokens an- Persian-Arabic, Devanagari, and more recently Ro-
notated with sixteen universal part-of-speech man(Sodharetal., 2019) is also getting popularity.
categories. Two experienced native annota- However, Persian-Arabic is standard script as well
tors annotated the SiPOS using the Doccano as frequently used in literary work, online commu-
text annotation tool with an inter-annotation
agreement of 0.872. We exploit the condi- nication, and journalism. Sindhi POS tagging has
tional random field, the popular bidirectional been previously investigated in various scripts in-
long-short-term memory neural model, and cluding Persian-Arabic,Devanagari (Jamro, 2017),
self-attention mechanism with various settings and Roman (Sodhar et al., 2019). However, the
to evaluate the proposed dataset. Besides low-resource Sindhi language lacks a POS labeled
pre-trained GloVe and fastText representation, dataset for its supervised text classification.
the character-level representations are incor- In this paper, we introduce a novel benchmark
porated to extract character-level information
using the bidirectional long-short-term mem- SiPOS dataset for the low-resource Sindhi lan-
ory encoder. The high accuracy of 96.25% guage. Due to the scarcity of open-source POS
is achieved with the task-specific joint word- labeled data, two native experienced annotators
level and character-level representations. The performed the POS annotation of Sindhi text using
SiPOS dataset is likely to be a significant re- the Doccano (Nakayama et al., 2018) text anno-
source for the low-resource Sindhi language. tation tool. To the best of our knowledge, this is
1 Introduction the first attempt to address the Sindhi POS tagging
at a large scale by proposing a new gold-standard
1
Annotated corpus is an essential resource for de- SiPOSdataset and exploiting conditional random
veloping automatic natural language processing field (CRF), bidirectional long-short-term memory
(NLP) systems (Ali et al., 2020). Such language (BiLSTM)network, and self-attention for its evalu-
resources (LRs) play a significant role in the digital ation. Our novel contributions are as follows:
survival of human languages (Jamro, 2017). The Wereveal a novel open-source gold-standard
part-of-speech (POS) tagging is a fundamental pre- SiPOSdatasetforthelow-resourceSindhilan-
processingtaskinvariousNLPapplications(Mahar guage. We manually tagged more than 293k
and Memon,2010), used to assign appropriate in- tokens of the Sindhi news corpus using the
context POS tags to each word. One of the main Doccanotext annotation tool.
´
challenges (Britvic, 2018) in POS tagging is am-
biguity since one word can take several possible We compute the inter-annotator agreement
POSlabels. Another problem is the unspoken or and exploit machine learning models of CRF
complexPOSorwords. Bothoftheseproblemsare and BiLSTM, self-attention to evaluate the
not rare in natural languages. Moreover, there is a proposed dataset with different settings.
lack of benchmark labeled datasets for Sindhi POS
tagging. To tackle these challenges, we propose a 1The SiPOS dataset is publicly available @ https://
novel benchmark SiPOS tagset. github.com/AliWazir/SiPOS-Dataset
22
Proceedings of the Student Research Workshop associated with RANLP-2021, pages 22–30,
held online, Sep 1–3, 2021.
https://doi.org/10.26615/issn.2603-2821.2021_004
Besides pre-trained GloVe, fastText word- Table 1: The statistics of news articles utilized for the
level representation, the task-specific annotation of SiPOS tagset.
character-level word representations are incor- Resource Articles Sentences tokens
porated to extract character-level information Kawish 563 3 769 1 58 145
using BiLSTMencoder. Awami-Awaz 458 3 015 1 35 539
2 Related Work Total 1 021 6 784 2 93 684
The labeling of natural language text with POS
tags can be a complicated task, requiring much 3.1 Preprocessing
effort, even for trained annotators (Rane et al., Sindhi news corpus contains a certain amount of
2020). A large number of LRs are publicly unwanted data (Ali et al., 2019). Thus, filtering
available for high-resource languages such as En- out such data and normalizing it is essential to ob-
glish (Marcus and Marcinkiewicz), Chinese, Indian tain a more authentic vocabulary for the annotation
languages(BaskaranSankaranandSubbarao,2008; project. The preprocessing steps consist of the fol-
Khanetal., 2019) and others (Petrov et al., 2012). lowing steps:
Unlike rich-resourced languages such as English
andChinesewithabundantpubliclyaccessibleLRs, Removal of unwanted multiple punctuation
Sindhi is relatively low-resource (Ali et al., 2019), marks from the start and end of the sentences.
which lacks the POS tagged dataset that can be uti-
lized to train a supervised or statistical algorithm. Filtration of noisy data such as non-Sindhi
Previously, Mahar and Memon (2010) proposed words, special characters, HTML tags, emails,
a Sindhi POS labeled dataset consists of 33k to- URLs,etc.
kens. Later, Dootio and Wagan (2019) published a
newdataset containing 6.8K lexicon, which is in- Tokenization to normalize the text, removal
sufficient to train a robust supervised classification of duplicates, and multiple white spaces.
algorithm. More recently, (Rahman et al., 2020)
annotated 100K words by employing a multi-layer Moreover, it requires human efforts and care-
annotation model, which comprises different an- ful assessment for the consistency in the labeled
notation layers like POS, morphological features, dataset. Sindhi Persian-Arabic is being written
dependency structure, and phrase structure. But in the right to left direction (Jamro, 2017). An
their dataset is not publicly available. Except for example of a Sindhi sentence is given in Table 2
Sindhi Persian-Arabic, the POS tagged datasets in with language-specific and corresponding univer-
Devanagari(Motlanietal.,2015),andRoman(Sod- sal part-of-speech (UPOS) tags. A Sindhi word
har et al., 2019) scripts have also been introduced. comprises one or more clitics or segments (Narejo
ThePOStaggedcorpusofSindhi-Devanagari con- and Mahar, 2016), typically a stem to which prefix
sists of 44K tokens, while Sindhi-Roman (Sodhar and suffix may be attached. Therefore, the tag-
et al., 2019) only contains 100 sentences. The re- ging can be done for each clitic in sequence or
view of existing work shows that the low-resource a word simultaneously. For the annotation, we
Sindhi language lacks a benchmark POS labeled used Doccano (Nakayama et al., 2018) to assign a
dataset for its supervised text classification. POSlabeltoeachtoken. The Doccano is an open-
source annotation platform for sequence labeling
3 CorpusAcquisition and Annotation and machine translation tasks. We engaged two
In this section, we illustrate the opted annotation native graduate students of linguistics for the anno-
methodology. We utilized the news corpus (Ali tation purpose. The annotators also used an online
Sindhi Thesaurus portal2 in case of ambiguity or
et al., 2019) of popular and most circulated Sindhi confusion while deciding a POS label for a token.
newspapersofKawishandAwami-Awaz(seeTable Moreover, the project supervisor also worked with
1. Two native graduate students of linguistics were annotators to monitor annotation quality by follow-
engaged for the annotation purpose using the Doc- ing the annotation guidelines (Dipper et al., 2004;
cano (Nakayama et al., 2018) text annotation tool Petrov et al., 2012).
to assign a POS label to each token. The detailed
annotation process is illustrated as under: 2http://dic.sindhila.edu.pk/
23
Table 2: An example of a Sindhi sentence with its corresponding language specific and universal part-of-speech
tags. The Roman transliteration of each token is given for the ease of reading.
. يهآ ودنيو ويارڪ لمع ناس يتخس يت ايڊيم ڪنارٽڪيلا ۽ ٽنرپ Sentence
SYM AUX VERB VERB NOUN ADP NOUN ADP NOUN NOUN CONJ NOUN UPOS
PUNCT AUX VB VB NN ADP NN ADP NN NN CONJ NN Tag
يناشن يج ڪهيب نواعم لعف لعف لعف مسا رج فرح مسا رج فرح مسا مسا ولمج فرح مسا
ِ ِ Sindhi POS
3.2 Consistency Evaluation 4.1 Conditional Random Field
Toensure annotation consistency, we measure the We initially evaluate the SiPOS dataset using a
inter-annotator agreement to investigate the consis- CRF (Lafferty et al., 2001), widely used in se-
tency in which annotators agreed to the tags. To quence classification (Sutton et al., 2012) tasks.
measure inter-annotator agreement, we chose to TheCRFisusefultoconsidertherelationship be-
use Cohen’s Kappa (Cohen, 1960). Cohen’s Kappa tween labels and jointly decode the most suitable
measures the inter-annotator agreement between chain of labels for a given input sentence (Huang
two annotators. Since we have two annotators, we et al., 2015).
compute this measure for POS tag pairs that show 4.2 Representation Learning
agreement between two annotators, which leads to
tworesults. The inter-annotator agreement comes Representation learning aims to capture the use-
out to be 0.872 with a confidence percentile of 95%. ful semantic, syntactic, and morphological in-
Cohen’s Kappa value shows that the dataset is of formation (Santos and Zadrozny, 2014) in NLP
acceptable quality. tasks (Bojanowski et al., 2017). We use pre-trained
GloVe (Pennington et al., 2014), fastText (Bo-
3.3 SiPOSDataset janowski et al., 2017) word-level representations,
character-level representations as well as joint
TheSiPOShasbeenannotatedusingthenewscor- character-level and word-level WordCharacter rep-
pus (Ali et al., 2019) of Kawish and Awami-Awaz resentations (Shao et al., 2017; Matteson et al.,
Sindhi newspapers. Sindhi grammar (Oad, 2012) 2018) to extract the word-level features. Pre-
give Sindhi POS of nouns, verbs, adjectives, ad- trained word representations enable neural mod-
verbs, pronouns, prepositions, conjunctions, nu- els to exploit the raw textual data larger than an-
merals, articles, and interjections. The dataset con- notated data. Then, we jointly learn the task-
sists of more than 293k tokens annotated with six- specific character-level word representations (Liu
teenSindhiPOSandUPOScategories,respectively. et al., 2018) using the BiLSTM network. The task-
Thecomplete statistics of the utilized news corpus specificcontextualrepresentationsincludethePOS-
in the annotation is given in Table 1. The detailed based knowledge.
label distribution in the SiPOS is given in Table 3. 4.2.1 GloVe
TheGloVe(Pennington et al., 2014) is a word rep-
4 Evaluation Methods resentation model that relies on two methods of
global word-to-word co-occurrence statistics and
Weevaluate the SiPOS for the consistency in the local context window. We obtain the pre-trained
dataset by computingtheinter-annotatoragreement GloVerepresentation by training on the large cor-
using Cohen’s Kappa (Cohen, 1960) coefficient. pus of more than 61 million words (Ali et al.,
WeevaluatetheproposedSiPOSdatasetbyexploit- 2019). We train GloVe with AdaGrad by choosing
ing CRF, BiLSTM and integrating CRF and self- the context window of 5 and the 300-dimensional
attention in the BiLSTM network for strong base- word representations. We filter out Sindhi stop
lines. Moreover, pre-trained GloVe, fastText word words (Ali et al., 2019) in the preprocessing step.
representations, and task-specific character-level, 4.2.2 fastText
and joint WordCharacter level representations are
incorporated to extract word-level and character- ThefastText (Bojanowski et al., 2017) is similar to
level information using the BiLSTM encoder. Word2vec(Mikolovetal., 2013). It uses subword
24
Table 3: Complete statistics of SiPOS dataset with the number of POS in each label. The highest proportion in the
POSlabels is noun, followed by preposition and verb.
information in the prediction model to obtain word through BiLSTM network (Shao et al., 2017; Mat-
representations. We train fastText on recently pro- teson et al., 2018) which are different from pre-
posed unlabelled Sindhi corpus (Ali et al., 2019) trained word representations. The BiLSTM is good
of more than 61 million words. In training, we at capturing prefixes and suffixes from the given in-
use the recommended sub-sampling (Bojanowski put text (Zhang et al., 2018). It consists of intercon-
−−−−→
et al., 2017), negative sampling, the minimum and nectedbidirectional forward LSTM andbackward
←−−−−
maximumlengthofcharacter ngrams (Grave et al., LSTMhiddenlayers,whichefficientlyencodethe
2018), minimum word count, learning rate, 300- contextual information.
dimensional representations, and default context 4.3 Neural POSTaggers
windowsize.
4.3.1 BiLSTM
4.2.3 Character-level Word Representations TheBiLSTMnetwork(SchusterandPaliwal,1997)
Thecharacter-level representations have an advan- has been broadly used in a variety of sequence la-
tage in handling the out-Of-the-vocabulary prob- belling tasks (Huang et al., 2015; Ma and Hovy,
lem because they can learn almost all character 2016; Peters et al., 2017) including POS tag-
representations from even small or moderate cor- ging (Kann et al., 2018). In this work, we evaluate
pus (Jia and Ma, 2019). In other words, these rep- the SiPOSdataset using the BiLSTM network. The
resentations are good at inferring unseen words modelconsists of representations layer, BiLSTM
and sharing information about morpheme-level encoder, and softmax for each position in the final
regularities. The BiLSTM network learns the layer. The bidirectional layers extract character-
character-level representations of words and as- level, word-level features and then adopt a random
sociates them with usual word representations to initialization method to transform words into rep-
perform POS tagging. We employ task-oriented resentations. The BiLSTM word-level (pre-train)
strategy (Liu et al., 2018) for character-level and model is the same as BiLSTM (word-level) but
joint WordCharacter level representations learned adoptsGloVeandfastTextforrepresentations. Sim-
25
no reviews yet
Please Login to review.