248x Filetype PDF File size 0.51 MB Source: www.lrec-conf.org
GrammarExtractionfromTreebanksforHindiandTelugu
Prasanth Kolachina, Sudheer Kolachina, Anil Kumar Singh, Samar Husain,
ViswanathaNaidu,RajeevSangalandAksharBharati
Language Technologies Research Centre,
IIIT-Hyderabad, India
{prasanth k, sudheer.kpg08, anil, vnaidu, samar}@research.iiit.ac.in,sangal@iiit.ac.in
Abstract
Grammarsplayanimportant role in many Natural Language Processing (NLP) applications. The traditional approach to creating gram-
mars manually, besides being labor-intensive, has several limitations. With the availability of large scale syntactically annotated tree-
banks, it is now possible to automatically extract an approximate grammar of a language in any of the existing formalisms from a
corresponding treebank. In this paper, we present a basic approach to extract grammars from dependency treebanks of two Indian lan-
guages, Hindi and Telugu. The process of grammar extraction requires a generalization mechanism. Towards this end, we explore an
approach which relies on generalization of argument structure over the verbs based on their syntactic similarity. Such a generalization
counters the effect of data sparseness in the treebanks. A grammar extracted using this system can not only expand already existing
knowledge bases for NLP tasks such as parsing, but also aid in the creation of grammars for languages where none exist. Further, we
showthat the grammar extraction process can help in identifying annotation errors and thus aid in the task of the treebank validation.
1. Introduction information in the form of weights associated with the
Large scale annotated resources such as syntactic tree- primitive elements in the grammar (Xia, 2001).
banks, PropBank, FrameNet, VerbNet, etc. have been at One of the important issues with any kind of anno-
the core of Natural Language Processing (NLP) research tated corpora is data sparseness. Sparseness of annotated
for quite some time. For a language like English for data has a detrimental effect on the performance of nat-
which these resources were first developed, they have ural language processing applications trained over such
proved to be indispensable in advancing the state-of-art corpora. In the case of syntactic annotation, information
for hosts of applications. Following the success of efforts about the argument structure of the verb is crucial for
like the Penn TreeBank (PTB) (Marcus et al., 1994), applications such as parsing. For instance, there exist
Prague dependency treebank (Hajicova, 1998), several differences among individual verbs in the number of
attempts are underwaytobuildsuchNLPresourcesfornew their annotated instances based on the frequency of their
languages. One such ongoing effort is to create a treebank occurrence. The number of annotated instances greatly
for Hindi-Urdu (Bhatt et al., 2009; Palmer et al., 2009; varies from verb to verb. In fact, the sparse data also poses
Begumetal., 2008a). Begum et al. describe a dependency a challenge for grammar extraction from treebanks. One
annotation scheme based on the Computational Paninian of the ways to overcome this limitation of sparse data
Grammar or CPG (Bharati et al., 1995). The treebank in syntactic treebanks is through generalization of the
being developed using this annotation scheme currently argument structure across different verbs. Furthermore,
contains around 2500 sentences. Despite its modest size, generalization based on clustering can lead to creation of
the Hindi treebank has helped improve considerably the verb classes based on the similarity of argument structure.
accuracies for a variety of NLP applications, especially In this paper, we present a basic system to extract a depen-
parsing (Bharati et al., 2008). dency grammar in the CPG formalism from treebanks for
TheroleofgrammarsinthedevelopmentofadvancedNLP two languages, Hindi and Telugu. Towards this end, we
systems is well known. Traditionally, the task of creating a explore an approach which relies on generalization of ar-
grammarforalanguageinvolvedselectingaformalismand gumentstructure over verbs based on the similarity of their
encoding the patterns in that language as rules, constraints syntactic contexts. A grammar extracted using this system
etc. But with the availability of large scale syntactically cannotonlyexpandanalreadyexistingknowledgebasefor
annotated treebanks, it is now possible to automatically NLPtasks such as parsing, but also aid in the creation of a
extract an approximate grammar of a language in any of useful resource. Further, the grammar extraction process
the existing formalisms from a corresponding treebank, can help in identifying annotation errors and thus make the
thus reducing human effort considerably. This method of task of the treebank validation easier.
extracting grammars from treebanks allows for creation 2. Goals of the paper
and expansion of knowledge bases for parsing. Grammars Themaingoalsofthis paper are as follows:
extracted through this method can be used to evaluate
the coverage of existing hand-crafted grammars. The 1. TopresentasystemthatextractsgrammarsintheCPG
extraction process itself can help detect annotation errors. formalism from the Hindi and Telugu treebanks
Another major advantage of extracting grammars from
treebank as compared to the traditional approach of 2. To use the extracted grammar to improve the coverage
handcrafting grammars is the availability of statistical of an existing hand-crafted grammar for Hindi, which
3803
is being used for parsing (Bharati et al., 2009a) her work, Xia has demonstrated the process for treebanks
3. To generalize verb argument structure information of three languages: English, Chinese and Korean. She
over the extracted verb frames to address sparsity in also showed that grammars extracted using LexTract
the annotated corpora have several applications. They can be used as stand
alone grammars for languages that do not have existing
4. To aid in the validation of treebanks by detecting dif- grammars. They can be used to enhance the coverage of
ferent types of annotation errors using the extracted already existing grammars. They can be used to compare
grammars grammars of different languages. The derivation trees
extracted using LexTract can be used to train statistical
3. Related Work parsers and taggers. LexTract can also help detect certain
In this section we briefly survey some of the work on gram- kinds of annotation errors and thereby, semi-automate
mar extraction, generalization using syntactic similarity. the process of treebank validation. A major advantage of
We also mention a few details about both the Indian lan- the LexTract approach to grammar development is that it
guage treebanks that we used. Syntactic alternation can be can provide valuable statistical information in the form of
an important criterion while generalizing verbs. We briefly weights associated with primitive elements.
discuss how syntactic alternation in Hindi differs from En- The work we present in this paper is on the same
glish. lines as the LexTract approach to grammar development,
3.1. GrammarExtraction but it is on a much smaller scale. It is meant to be the
The role of grammars in NLP is more extensive than is first step towards building a LexTract like system for
generally supposed. Xia (2000) points out that the task of extracting CPG grammars for Indian languages. Since we
treebanking for a language bears much similarity to the worked with dependency treebanks of Hindi and Telugu,
task of manually crafting a grammar. The treebank of a we chose a dependency grammar formalism known as
language contains an implicit grammar for that language. Computational Paninian grammar (CPG). In fact, the
Statistical NLP systems trained over a treebank make use annotation guidelines followed to annotate the treebank are
of this grammar implicit in the treebank. This is why based on this grammar (Bharati et al., 2009b). As such, the
grammar driven approaches and data driven or statistical grammar extraction process is much more straightforward
approaches are not necessarily mutually exclusive. It than the one in LexTract. In the next section, we give a
is well known that the traditional approach of manually brief outline of the CPG formalism where we define the
crafting a high quality, large coverage grammar takes basic terminology and briefly discuss the components of a
tremendous human effort to build and maintain. In CPGgrammar.
addition, the traditional approach does not provide for 3.2. Generalization Based on Syntactic Similarity
flexibility, consistency and generalization. To address these
limitations of the traditional approach to grammar develop- The problem of sparse data in Propbank has been previ-
ment, Xia (2001) presents two alternative approaches that ously addressed using syntactic similarity based general-
generate grammars automatically, one from descriptions ization of semantic roles across verbs (Gordon and Swan-
(LexOrg) and the other from treebanks (LexTract). son, 2007). We try to address the data sparseness prob-
lembygeneralizing over argument structure across syntac-
The LexTract system extracts explicit grammars in tically similar verbs to arrive at an automatic verb classifi-
the TAG formalism from a treebank. It is not, however, cation. Gordon and Swanson (2007) define syntactic simi-
limited to the TAG formalism as it can also extract CFGs larity for phrase structure trees using the notion of a parse
from a treebank. Large scale treebanks such as the English tree path (Gildea and Jurafsky, 2002). Gildea and Jurafsky
Penn Treebank (PTB) are not based on existing gram- define a parse tree path as ‘the path from the target word
mars. Instead, they were manually annotated following through the parse tree to the constituent in question, repre-
the annotation guidelines. Since the process of creating sentedasastringofparsetreenon-terminalslinkedbysym-
annotation guidelines is similar to the process of building bols indicating upward and downward movement through
a grammar by hand, it can be assumed that an implicit the tree’. This parse tree path feature is used to represent
grammar, hidden in the annotation guidelines, generates the syntactic relationships between a predicate and its ar-
the structures in the treebank. This implicit grammar can guments in a parse tree. The syntactic context of a verb is
be called a treebank grammar. As suggested by Xia, the extracted as the set of all possible parse tree paths from the
task of grammar extraction using LexTract can be seen parse trees of sentences containing that verb. The syntac-
as the task of converting this implicit treebank grammar tic context of a verb is then converted into a feature vector
to an explicit TAG grammar. LexTract builds an LTAG representation. The syntactic similarity between two verbs
grammar in two stages. First, it converts the annotated is calculated using different distance measures such as Eu-
phrase structure trees in the PTB into LTAG derived trees. clidean distance, Chi-square statistic, cosine similarity etc.
In the second stage, it decomposes these derived trees In our work, we present an analogous measure of syntactic
into a set of elementary trees which form the basic units similarity for the dependency structures in the Indian Lan-
of an LTAG grammar. It also extracts derivation trees guage (IL) Treebanks, which is described in section 5. We
which provide information about the order of operations characterize the syntactic context of a verb using a karaka
necessary to build the corresponding derived trees. In framerepresentation. The notion of karakas is explained in
3804
the next section. In the above sentences, the nominal vibhaktis (case-
endings or post-positions) change according to the
3.3. Syntactic Alternations in Hindi TAM and agreement features of the verb. This co-
Syntactic alternations of a verb have been claimed to re- variation of vibhaktis with verb’s inflectional features
flect its underlying semantics properties. Levin’s classifica- is true not only of finite verb forms but also of non-
tion of English verbs (Levin, 1993) based on this assump- finite verb forms. All this information is exploited in
tion demonstrates how syntactic alternation behavior of a the CPG formalism in a systematic way, as discussed
verb can be correlated to its semantic properties thereby in the next section.
leading to a semantic classification. There have also been 3.4. Indian Language Treebanks
several attempts at automatically identifying distinct clus-
ters of verbs that behave similarly using clustering algo- In this sub-section, we give a very brief overview of the
rithms. These empirically-derived clusters were then com- treebanks used in our work. We worked with treebanks of
pared against Levin’s classification (Merlo and Stevenson, two Indian languages, Hindi and Telugu. The treebanks
2001). for Hindi and Telugu contain 2403 and 1226 sentences re-
The following are some linguistic aspects of verb alterna- spectively. The development of these treebanks is an on-
tion behavior that we encountered in Hindi: going effort. The Hindi treebank is part of a multi-level
resource development project (Bhatt et al., 2009). Some of
² In Hindi, the inchoative-transitive alternation pattern the salient features of the annotation process employed in
cannot be considered an alternation of the same verb the development of these treebanks are as follows:
stem. The verb stems in such constructions, although
morphologically related, are mostly distinct. This is ² The syntactic structure of sentences is based on the
illustrated in the examples below: dependency representation scheme.
Inchoative: ² Dependency relations in the Hindi treebank are anno-
tated on top of a manually POS-tagged and chunked
darawAzA KulA corpus. In the Telugu treebank, the POS-tagging and
door-3PSg-Nom open chunking was not performed manually.
’The door opened.’
Transitive: ² Dependency relations are defined between chunk
heads.
Atifa-ne darawAzA KolA
Atif-3PSg-Erg door-3PSg open ² The dependency tagset used to annotate dependency
’Atif opened the door.’ relations is based on the CPG formalismwhichwedis-
cuss in section 4.
² Similarly, the diathesis alternation pattern discussed 4. Computational Paninian Grammar
byLevinis not exhibited by Hindi verbs.
² SinceHindiisamorphologicallyrich,free-wordorder In this section, we give a brief overview of the Computa-
language, the alternations are not with respect to the tional Paninian Grammar (CPG) formalism. We only out-
position of the constituent as is the case in English. In line details relevant to our goal of grammar extraction. See
Hindi,alternationsarewithrespecttothecase-endings Bharati et al. (1995) for a detailed discussion of the CPG
(or the post-positions) of the nouns, which are called formalism and the Paninian theory on which it is based. In
vibhaktis in CPG. subsection 4.1, we introduce the basic terminology neces-
sary for an overview of this formalism.
² Post-positions or vibhaktis alternation is determined 4.1. Terminology
by the form that the verb stem takes in a particular ² The notion of karaka relations is central to Paninian
construction. In other words, the arguments of a verb Grammar. Karaka relations are syntactico-semantic
are realized using different case-endings or vibhaktis relations between the verbs and other related con-
based on the tense, aspect and modality (TAM) fea- stituents in a sentence. Each of the participants in an
tures of the verb. This is illustrated in the examples activity denoted by a verbal root is assigned a distinct
below: karaka. There are six different types of karaka rela-
abhaya rotI KatA hE tions in the Paninian grammar as listed below:
Abhay-Nom-3PSgM bread eat-pres.simp.-3PSgM 1. k1: karta, participant central to the action de-
’Abhay eats bread.’ noted by the verb
abhaya-ne rotI KAyI 2. k2: karma, participant central to the result of the
Abhay-Erg bread-3PSgF eat-past.simp.-3PSgF action denoted by the verb
’Abhay ate bread.’ 3. k3: karana, instrument essential for the action to
abhaya-ne rotI-ko KAyA take place
Abhay-Erg bread-Acc eat-past.simp.-default 4. k4: sampradana, beneficiary/recipient of the ac-
’Abhay ate bread.’ tion
3805
Figure 1: Basic demand frame for the verb ‘de’ (to give)
Figure 2: ‘yA’ transformation frame for transitive verb
5. k5: apadana, participant which remains station- features assigned to the verb in a syntactic construc-
ary (or is the reference point) in an action involv- tion. Therefore, it can also be referred to as the TAM
ing separation/movement marker of the verb.
1 In the previous example sentence, the nouns ’Atifa’
6. k7: adhikarana, real or conceptual space/time
and ’kuAM’ have the vibhaktis ’-ne’ and ’-se’ respec-
For example, in the following example sentence: tively. The vibhakti of the verb ’nikAla’ is ’yA’ which
is also its TAM label.
samIrA-ne abhaya-ko phUla diyA Nominal vibhaktis have also been found to be impor-
Samira-Erg Abhay-Dat flower-Acc give.past tant syntactic cues for identification of semantic role
------------------------------------.3PSgM in the CPG scheme (Bharati et al., 2008).
’Samira gave a flower to Abhay.’
4.2. ComponentsofCPG:DemandFramesand
Samira is the karta (k1), the flower is the karma (k2) Transformation Frames
and Abhay is the sampradana (k4). Similarly, in the Akey aspect of Paninian grammar (CPG) is that the verb
following example: group containing a finite verb is the most important word
group (equivalent to the notion of a ’head’) of a sentence.
Atifa ne kueM se pAnI nikAlA For other word groups in the sentence dependent on this
Atif-Erg well-Abl water-Acc draw.3PSgM head, the vibhakti information of the word group is used
’Atif drew water from the well.’ to map it to an appropriate karaka relation. This karaka-
vibhakti mapping is dependent on the main verb and its
Atif is the karta (k1), well is the apadaana (k5) and TAMlabel. This mapping is represented by two templates:
water is the karma (k2). default karaka chart (also known as basic demand frames)
In addition to these karaka relations, there are some and karaka chart transformation (also known as transfor-
additional relations in the Paninian scheme such as mation frame). The default demand frame defines the map-
tadarthya (or purpose)2. ping for a verb or a class of verbs with respect to a basic
reference TAM label. It specifies the karaka relations se-
² The notion of vibhakti relates to the notion of local lected by the verb along with the vibhaktis allowed by the
word groups based on case ending, preposition and basic TAMlabel. The basic reference TAM label in CPG is
post-position markers. For a nominal word group, vib- chosen to be ’tA hE’ which is equivalent to Present Indefi-
hakti is the post-position (also known as parsarg) oc- nite/Simple Present. For any other TAM label of that verb
curring after the noun. Similarly, in the case of verbal orverbclass,atransformationruleisdefinedthatcanbeap-
word group, a head verb may be followed by auxil- plied to the default demand frame to obtain the appropriate
iary verbs whichmayremainasseparatewordsormay karaka-vibhakti mapping for that TAM combination. The
combine with the head verb. This information follow- transformation rules can affect the default demand frames
ing the head verb (in other words, verb stem) is col- in three ways, each defined as an operation in CPG:
lectively called the vibhakti of the verb. The vibhakti
of a verb contains information about the tense, aspect 1. Insert: A new karaka relation is inserted into the de-
and modality (TAM) and also Agreement, which are mandframealongwithits vibhakti mapping
2. Delete: Anexistingkarakarelationisdeletedfromthe
1In the tagset used, k7p represents spatial location, k7t repre- default demand frame
sentstemporallocationandk7/k7vrepresentsconceptuallocation.
2The complete tagset can be found at http: 3. Update: A karaka-vibhakti mapping entry in the de-
//ltrc.iiit.ac.in/MachineTrans/research/ fault demand frame is updated by modifying the vib-
tb/dep-tagset.pdf hakti information according to the new TAM label
3806
no reviews yet
Please Login to review.