316x Filetype PDF File size 0.15 MB Source: www.ijircce.com
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
International Journal of Innovative Research in Computer
and Communication Engineering
(An ISO 3297: 2007 Certified Organization)
Vol. 3, Special Issue 7, October 2015
Rule Based Morphological Analyzer for
Malayalam Nouns: Computational Analysis of
Malayalam Linguistics
Jancy Joseph, Dr. Babu Anto
School of Information Science and Technology, Mangattuparamba Campus, Kannur University, Kerala, India
ABSTRACT: The morphological analysis deals with the study of internal structure of words of a language based on its
grammatical category. The morphological analyzer system is developed for plural markers, case markers, post positions
and clitics (gathi) markers for Malayalam nouns. This work focuses on segmenting a morphologically inflected word
into its root word and its associated morphological components along with the features specifying the morphological
structure. The outputted words in this system are categorized into different classes of noun, which is implemented using
Malayalam Unicode Standard. Malayalam Morph Analyzer would help in automatic spelling and grammar checking,
natural language understanding, machine translation, speech recognition, speech synthesis, part of speech tagging, and
parsing applications. The common man can also get in-depth information about the Malayalam nouns from the
software.
KEYWORDS: Morphological Analysis, Malayalam Noun Morphology,Root word,Suffix
I. INTRODUCTION
The field of Natural language processing (NLP) is primarily concerned with getting computers to perform useful and
interesting tasks with natural languages. NLP is a field of computer science, integrating artificial intelligence
and linguistics, which is concerned with the interactions between computers and human (natural) languages. Many
challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human
or natural language input, and others involve natural language generation.NLP is the process of computer analysis in
which input provided in a natural language is analyzed and converted into an output in a useful form of representation.
The field of NLP is secondarily concerned with helping us come to a better understanding of natural language.
II. MORPHOLOGY
Morphology is the study of the structure and formation of words. It deals with the ways that words are built up from
smaller meaningful units called morphemes. Morphemes can usefully be divided into two classes-Stems and Affixes.
Fig.1 Stem And Affixes
Copyright to IJIRCCE www.ijircce.com 67
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
International Journal of Innovative Research in Computer
and Communication Engineering
(An ISO 3297: 2007 Certified Organization)
Vol. 3, Special Issue 7, October 2015
III. MORPHOLOGICAL ANALYSIS
This phase separate words into individual morphemes and identify the class of the morphemes. The difficulty of this
task depends greatly on the complexity of the morphology of the language being considered. In Languages
like Malayalam, each dictionary entry has thousands of possible word forms, as Malayalam is one of the highly
agglutinated Indian Language. So building morphological analyzer for Malayalam is a complex task.Generally, Rule
Based Approaches are used for morphological analysis which are based on a set of rules and dictionary that contains
root words and morphemes. In Rule Based Approach, a particular word is given as an input to the morphological
analyzer and if the corresponding morpheme or root word is missing in the dictionary, then the rule based system may
fail. Here, each rule depends on the previous rule. So if one rule fails, it affects the entire set of rules which follows.
IV. MORPHOLOGICAL ANALYZER
A Morphological Analyzer split the word into its constituent morphemes. This Morphological Analyzer for Malayalam
words can be further used in a machine translation system. Malayalam is an agglutinative language. Morphological
Analyzer will help to identify the inflection of a word and it segment the word into stem and affixes. This affixes can be
gender (feminine, masculine or neutral), person (1st, 2nd or 3rd), or number (Singular or Plural) information in the case
of nouns, and tense, aspects and modality of the word in the case of verbs. Morphological analyzer for Noun returns the
root of the word along with its gender, number and case. For verb it will return the root form along with its tense,
modality, and aspects.
Eg: മരšൾ =മരം(N)+കൾ (PL)
V. CHALLENGES IN BUILDING MORPH ANALYZER
Many changes take place at the boundaries of morphs and words. Identifying the rules that govern these morpho-
phonemic changes is a challenge because dissimilar changes take place in similar contexts. In such cases it is necessary
to look into the morphological as well as phonological factors which make such changes.
VI. VARIATIONS OF MORPHOLOGY
Morphological Variations for a word occurs due to inflections, derivations, clitization, word compounding etc.
Word morphology can further divided into three broad classes.
a. Inflectional Morphology
Inflection is the process of changing the form of a word so that it expresses information such as number,
person, case, gender, tense, mood and aspect, but the syntactic category of the word remains unchanged.
Inflectional morphology concerns with the combination of stems and affixes where the resulting word has the
same word class as the original and it serves a grammatical/semantic purpose that is different from the
original, but is nevertheless transparently related to the original.
Eg:കുŨി/കുŨികൾ
b. Derivational Morphology
Derivation changes the syntactic category of a word. Derivational morphology is includes irregular
meaning change and changes of word class. Eg: ഭംഗി/ഭംഗിയുƄ
c. Clitization
A clitic is an element that behaves like an affix and a word. However, they are quite complicated in
that they are also part of word formation.
Eg: Cat/Cat’s
VII. MORPHOTACTICS
Morphotactics is the order in which the morphemes are arranged. It is also a kind of restriction on
morphemes. The order in which the morphemes appear in a word must be described any computational model of the
morphology. It is a fact of any language that one can usually stack up morphemes in some orders but not in others.
Copyright to IJIRCCE www.ijircce.com 68
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
International Journal of Innovative Research in Computer
and Communication Engineering
(An ISO 3297: 2007 Certified Organization)
Vol. 3, Special Issue 7, October 2015
There are many conditions on morpheme ordering that may follow straightly from the kind of constraints like
affixation, infixation, pre-fixation and suffixation (Rajeev,2008).
VIII. NEED AND SIGNIFICANCE OF MORPHOLOGICAL ANALYSIS
Morphological Analyzer can either be an important stand-alone component of many applications like spelling
correction, information retrieval, machine translation etc. It can simply be a link in a chain of further linguistic analysis.
Any Natural Language Processing (NLP) application for any language starts with the development of Morphological
Analyzer or Word Analyzer, which analyzes the inflected word and provides information such as root word or stem and
its constituent morphemes with which the original word was constructed. Building morph analyzers for highly
inflectional languages is difficult but crucial for applications such as Machine Translation (MT) and Dialog Based
Natural Language Understanding Systems.
The development of NLP applications is challenging because computers traditionally require humans to
communicate to them in a programming language that is precise, unambiguous and highly structured or, perhaps
through a limited number of clearly-enunciated voice commands. Human speech, however, is not always precise, it is
often ambiguous and the linguistic structure can depend on many complex variables, including slang, regional dialects
and social context.
A morphological analyzer or generator supplies information concerning morph syntactic properties of the
words it analyses or constructs. The design and implementation of morphological analyzer and generator for
Malayalam is a promising research for various applications in NLP.
IX. OBJECTIVES OF THE STUDY
The central goal and core of the study is to develop a Morphological Analyzer (MA) for Malayalam Nouns
using Suffix-Stripping Method in a Rule-Cum-Dictionary Based Approach by integrating different modules and
various computational linguistic tools.
X. REVIEW OF RELATED LITERATURE
A Suffix Stripping based Morph Analyzer for Malayalam Language, by Rajeev R.et.al. (2008) adopted the
suffix stripping method as it contains suffixes, which are very close to the affix stripping method in its approach. In
Recognizer, a transducer takes a word as input, and accept outputs if the given word is in the language and rejects it if it
is not present. Two-level morphology represents a simple splitting of the morphemes from the word so as to get the
root/stem. Sumam Mary Idikkula(2007) & team developed a morphological processor for Malayalam language which
aimed at building a morphological processor for language, with two main components: a morphological generator and a
morphological analyzer. Jisha P. Jayan et.al(2006) developed a Morphological Analyzer and Morphological Generator
as part of their thesis on Malayalam - Tamil Machine Translation using a bilingual dictionary. They make use of suffix
stripping method for morphological analysis. Jisha. P J et. al ((2009) mentioned about the two common approach
towards the morphological analysis, paradigm approach and suffix stripping approach in their study titled
Morphological Analyzer for Malayalam- A Comparison of Different Approaches. It also discussed comparison with the
hybrid approach. Vinod P M, et.al (2001), makes use of Lttoolbox for morphological analysis, generation, lexical
processing etc. The program compares each inflected form. A Morphological analyzer for Malayalam using machine
learning was developed by V.P Abeera,S et.al.(2009). Morphological analysis for Malayalam verbs using a hybrid
approach (paradigm and suffix stripping method) is attempted for morphological generation of verbs. Saranya S
K(2008) developed a Morphological Analyzer for Malayalam Verbs using a hybrid approach of Paradigm method and
Suffix Stripping method. Nimal J Valath, et.al(2012) developed a morphological analyzer for nouns and verbs using
combined approach of paradigm and suffix stripping method. Initially they transliterated Malayalam to English, which
help to find the occurrence of affixes easily.
Copyright to IJIRCCE www.ijircce.com 69
ISSN(Online): 2320-9801
ISSN (Print): 2320-9798
International Journal of Innovative Research in Computer
and Communication Engineering
(An ISO 3297: 2007 Certified Organization)
Vol. 3, Special Issue 7, October 2015
XI. MALAYALAM NOUN MORPHOLOGY
A noun is a word that functions as the name of some specific thing or set of things, such as living creatures,
objects, places, actions, qualities, states of existence, or ideas. Linguistically, a noun is a member of a large, open part
of speech whose members can occur as the main word in the subject of a clause, the object of a verb, or the object of
a preposition. The nouns in Malayalam are marked for number and gender.
The present work has considered four main categories of inflections for nouns. They are: Gender, Plural
(Number) Markers, Case Markers and Clitics(gathi/postpositions).In Malayalam grammar, a classification of sandhi
rules is done based on whether a word ends with a vowel (swaram) or a consonant (vyanjanam).സřരസŸി,
സřരവŖĶജനസŸി, വŖĶജനസřരസŸി and വŖĶജനസŸി.
Sandhi can also be categorize into four on the basis of the changes occurring.They are േലാപസŸി,
ആഗമസŸി,ദിതřസŸി,and ആേദശസŸി.
XII. CATEGORIZATION OF NOUNS
Noun is a word that can be used to refer to a person, place, thing, quality, or action.It is the word class that can
serve as the subject or object of a verb, the object of a preposition, or in postposition. The noun can be classified as
Concrete Noun, Quality Noun, Verbal Noun, and Pronoun. Concrete noun further classified as Proper Noun, Common
Noun, Material Noun, and Collective Noun
XIII. METHODOLOGY AND IMPLEMENTATION
Malayalam morph analyzer for nouns is done using suffix stripping method with the reverse application of
Sandhi Rules. This rule based system uses a predefined set of dictionary of root words developed for the purpose.
When a word input is made, it checks with the dictionary to identify the stem word given and return the noun as output.
If it doesn’t match with the stem words in the dictionary, it checks for suffixes and strip it off to provide the output as
stem word and suffixes. The novelty of this work lies in the medium of input and output that is done in Malayalam Lipi
without transliteration and retransliteration process.
Since Malayalam is an agglutinative language, a noun can have a number of inflections by adding different
suffixes to it. There is no hard and fast morphotactic rule or order in adding suffixes to words in Malayalam. The major
challenge associated with suffix stripping method while using Malayalam Lipi is the replacement of added
Varnam(syllables) which is resulted due to the application of Sandhi Rules.
A hand built dictionary is used in this work. The word in the dictionary was derived from performing a corpus
analysis of few Malayalam books. All the unique words in this corpus including different categories of nouns were
included in the dictionary.
The Malayalam Unicode utf-8 standard is used in this work. Thus it is possible to store Malayalam noun corpus in its
own Lipi and could use Malayalam Lipi within the Coding too. This works deals seven different classes of noun (such
as Qualitative, Verbal, Proper, Common, Material, Collective and Pronoun) in addition to the Demonstrative and
Interrogative Pronoun.
XIV. ALGORITHMS
1. Input the word to be analyzed.
2. Check whether the given word is found in the Root Dictionary.
3. If the word is found in the dictionary, then go to step 8; else
4. Separate any suffix from the right hand side
5. If any suffix is present in the word, then remove the suffix and then re-initialize the word without the
identified suffix and go to Step 2.
6. Classify the suffix to its correct class.
7. Repeat this process until the Dictionary finds the root/stem word.
Copyright to IJIRCCE www.ijircce.com 70
no reviews yet
Please Login to review.