|
|
Persian Morphology
Karine Megerdoomian
Computing Research Laboratory
New Mexico State University
Introduction
Persian morphology is an affixal system consisting mainly of suffixes and
a few prefixes. The nominal paradigm consists of a relatively small number
of affixes. The verbal inflectional system is quite regular and can be
obtained by the combination of prefixes, stems, inflections and
auxiliaries.
The following is a brief description of Persian morphology as well as the
morphological grammar used in the Shiraz project. Since we are only
dealing with written text, the difficulties encountered in the analysis of
the written language are briefly discussed. The following sections
describe the nominal and verbal morphology of Persian. The last section
presents examples of the rules used in the morphological analyzer. The
transliteration used is described in the Appendix.
Ambiguities in Written Text
Certain ambiguities arise in a computational analysis of Persian text
since the same surface form can represent different morphemes. In
addition, short vowels are not marked in written text, which results
in different possibilities of analysis. For instance, the word mrdm
could be analyzed, among other possibilities, either as the noun
mardom (people) or as the past tense of the verb mordan (to
die): mordam (I died). Furthermore, certain
affixes always appear bound whereas others can also appear as free
morphemes. The morphological analyzer is able to recognize all
the possible surface forms of the affixes; it also uses the
information available from the parts of speech that the morpheme
appears on in order to disambiguate.
Nominal Morphology
There are no case forms and no gender distinctions in Persian. Person,
number and sometimes animacy, however, are distinguished. Although
there is no overt definite marker, a suffix is used on nouns
and adjectives to indicate indefiniteness. The enclitic suffix which
links nominal elements to a relative clause has the same surface form
as the indefinite. There exist several morphemes to mark plurality,
some of which are borrowings from Arabic. There are also some plural
forms in Persian that follow the Arabic template morphology (also
known as "broken" plurals) as shown below.
ketAb --> kotob (books)
faghir --> fogharA (poor [people])
But the rules for forming these plurals are not used productively in
Persian. These loan words are listed in the lexicon and need not
undergo morphological analysis.
The elements within a Noun Phrase are linked by the enclitic particle
called ezafe. This morpheme is usually an unwritten vowel, but it
could also have an orthographic realization in certain phonological
environments. The role of the ezafe is to mark nominal
determination and it indicates nothing as to the nature of the semantic
relation between the linked elements. In most cases, this relation can be
translated as a genitive structure. Examples of this construction are
given below:
sedA-ye pA-ye man
sound-ez foot-ez my
`(the) sound of my footsteps'
ru-ye miz
on-ez table
`on the table'
Adjectives follow the same morphological patterns as nouns. They can
also appear with comparative and superlative morphemes. Certain
adverbs, mainly manner adverbs, can behave like adjectives and can
appear with all the adjectival affixes. There are three types of
ordinal constructions in Persian, which are formed by attaching their
respective morphemes to the cardinal number.
Personal pronouns can appear either as free forms or as
clitics. Although these cliticized pronouns have the same surface
form, they can have different functions depending on the part of
speech or syntactic context that they appear on: On the last element
of a Noun Phrase, the clitic is interpreted as a possessive pronoun
ketAb-at [book + 2sg] (your book). Attached to transitive verbs
and prepositions, the clitic is the accusative form of the personal
pronoun did-am-at [see(past) + 1sg infl. + 2sg] (I saw you).
The clitic may appear on adverbials, numerical expressions and
interrogative elements with a partitive meaning, vasat-ash
[middle + 3sg] (in the middle of it). On intransitive verbs, it could
be used as the subject clitic. It is also used in impersonal verbal
constructions. Most of these usages, however, are limited to
colloquial speech and apart from the possessive clitics, they are
rarely used in written text.
The present indicative of the verb budan (to be) has a series of
enclitic forms which can attach to the elements within a Noun
Phrase. This morpheme is a verbal element but it can attach to nouns,
adjectives and classifiers. The morphological analyzer needs to recognize this
copula morpheme and separate it into a distinct lexical structure.
There exist other lexical elements, such as the preposition be, the
postposition rA, or the relativizer ke, that usually appear as
separate words in written text, but which can also be found as
attached morphemes.
Verbal Morphology
Inflectional Paradigm
The inflectional system for the Persian verbs consists of simple forms
and compound forms; the latter are forms that require an auxiliary
verb. The simple forms are divided into two groups according to the
stem they use in their formation: the tenses that use the Present Stem
and those formed on the Past (or Aorist) Stem. The Present Stem needs to be
specified in the lexicon since it cannot be derived, while the Past
Stem is easily derivable from the infinitival form of the verb. The
citation form for the verb is the infinitive.
In addition to the verb stems, the following elements also participate
in the formation of the verbal inflectional system in Persian:
-
Prefixes: the imperfective prefix my and the morpheme
b or by, which characterizes the subjunctive and the
imperative. Negation is marked by the n or ny prefix.
-
Personal Inflections: present, past and imperative personal
inflections
are used in conjugating the Persian verb. All verb forms are marked for
person and number.
-
Suffixes: the suffix ande marks the present participle
ending and e (written h) is used to form the past participle.
-
Causation morpheme: causatives are obtained by adding the
affix An or Ani to the end of the Present Stem of the verb.
Personal inflections and suffixes can then be attached to the Causative
Present Stem to derive all verbal forms for the causative construction.
-
Auxiliaries: Persian conjugation uses a number of auxiliaries
in the compound forms. The enclitic form of the auxiliary budan
(be) is the one used in the formation of the perfect forms of all verbs.
The verb khAstan (want) is used as an auxiliary in forming the
future tenses. The auxiliary shodan (become) forms the passive
constructions.
The complete inflectional system can be obtained by the various
combinations of these elements.
Light Verbs
Most verbal constructions in Persian are formed using a light verb
such as kardan (do, make), dAdan (give), zadan (hit,
strike). The number of verbs that can be used as light verbs is
limited, but these constructions are extremely productive in
Persian. These structures consist of a preverbal element, which could
be a noun, adjective or preposition, followed by a light verb, which
has partly or completely lost its original meaning. Since these Light Verb
or Compound Verb constructions are noncompositional in meaning, they
are included in the dictionary as compounds.
Verbal inflection can only appear on the light verb itself, but bound
morphemes can be attached to the preverbal element as well as the
light verb. These inflectional morphemes are analyzed in the morphological
component.
Morphological Grammar
The linguistic information associated with the morphemes is described
using a unification-based morphological formalism. The morphological
rule describes the concatenation of stems and morphemes (using regular
expressions) and the combination of morphological features of words
and morphemes (using feature structures and unification). Stems and
their features are stored in the lexicon as feature structures. A
morphological rule associates a surface form, representing a sequence of
morphemes, to a linguistic structure, and describes how the features
of the stem and the morpheme are combined.
As an example, consider the Plural rule for Persian given below
(string variables are prefixed with the dollar sign, regular
expressions are enclosed between angle brackets):
Plural = <
<$stem "hA">
[form: [orth.exp: "$stem$",
morph: [
lex: [pos: Noun],
infl: [number: Plural]]]]
>;
The regular expression in angled brackets describes the surface form of
the morpheme (the suffix hA in this example). The feature
structure on the next line gives a partial description of the
entry. form.orth.exp is the orthographic form (or citation
form) of the entry as it is input in the lexicon. The lexical
information available in the dictionary is presented under
morph.lex and the inflectional information is given under the
path morph.infl. The feature structureunifies the given parts
of speech (pos) with the morpheme information. In this
specific example, the morphological rule marks the
number feature as Plural in case the part of
speech attached to the lexical entry is a Noun.
It is also possible to account for the morphotactics of the
inflections (i.e., the relative order of the morphemes). For instance,
the indefinite marker in Persian can follow the plural morpheme but
the reverse is not true. This rule can be written in the following manner:
Indefinite = <
< <<$base \ Vowel> "yy">
[form.morph: [
lex.pos: Noun,
infl.indefinite: True]] > |
< $base
[form.morph: [
lex.pos: Noun,
infl.indefinite: False]] >
>;
The plural rule can be used within the Indefinite rule in order to
account for more complex morphological phenomena. The string analyzed
by the Plural rule is bound to the variable base. This variable can
thus be used in the Indefinite rule for checking, for instance, the
character that it ends in. In other words, after the plural morpheme
hA has been detected on the word, the Indefinite rule
applies. The first alternative checks if the surface form
of the base application ends in a vowel; this is true since hA
ends with the vowel "A". The following feature structure requires
this entry to be a Noun.
The successful application of this rule will add (unify) the corresponding
structure to the output feature structure. So, in this example, if the
suffix yy has been recognized following the plural morpheme hA,
the indefinite feature in the structure is marked
True, otherwise it's marked False.
Designed by Hamid R. M. Rad
hamid@crl.nmsu.edu
For problems with this Web site, send
mail
to
webmaster@crl.nmsu.edu
|