| Shiraz Home | Persian Linguistics | Dictionary Structure | Demos | Publications | Persian Resources |
This report describes some aspects of the architecture and gives an outline of the modules involved in the translation from Persian to English. The system is completely written in C++ and can be used both on Unix machines or on PCs. We have also applied Meat for translations from Korean, Japanese, Spanish, Russian, Serbo-Croatian, and Turkish.
Shiraz uses several types of edges to distinguish between different types and levels of description. Thus, the chart can not only be used for a single purpose (say, syntactic parsing or generation), but it stores all hypotheses on all levels. Internally, so-called tags are used to mark edges as to what module they belong. In fact, the chart used for Shiraz is a weaker version of the layered chart used in [Amtrup:97], in that it does not support hypergraphs or the distribution of modules to employ parallel processing.
Edges in the chart are annotated with complex typed feature structures following [Carpenter:92]. Different types of feature structures can be used to encode different aspects of liguistic knowledge conveniently. We use an efficient implementation based on a vector-oriented representation for feature structures. Figure 1 shows an image of a chart and some of its edges. In the lower part of the image, part of a feature structure is shown.
Figure 1: A Chart and some edges
An application definition file defines

Components and the application definition
The Shiraz system is designed to fulfill different functions within a
natural language processing scenario. Two main requirements have to be
met:
The approach we chose in order to realize a configurable, flexible
system is a combination of extreme modularization and user-defined
application. Shiraz consists of currently 27 different modules. The
user is able to compose a sequence of modules in order to build a
complete application. Upon runtime, the system interprets the
application definition and executes the modules needed.
A small excerpt from the Persian application definition file is shown
in figure 2. It exemplifies the composition of
modules to form a complete application, as well as the definition of
parameters, variables, and the incorporation of command-line parameters.
For the Persian case, we also added a Posttokenizer. The
task of this component is to postprocess the Tokenizer output with
respect to some peculiarities of Persian. In particular, detached
affixes are again attached to their kernels.
// Variable definitions
$RES=/home/mcm2/meat/per
// Global parameters
tangoModule = $(RES)/shiraz.mod
// An application
application lookup = Tokenizer($File=$1):PostTokenizer:MorphAnalyzer:
DictionaryLookup:DictionaryCompoundLookup:ChartViewer
// Sample module definitions
module Tokenizer {
class = Tokenizer
inputFile = /home/mcm/$File
encoding = UTF8
}
module MorphAnalyzer {
class = MorphAnalyzer
grammar = $(RES)/GenMorph.samba
rule = Morphology
type = chart
sourceTag = TOKEN
targetTag = MATOKEN
}
Components of the Shiraz system
In this section, we give a short overview of the main components that are
involved in constructing an English translation from a Persian
document. Using the mechanism just mentioned, an application is
defined as a sequence of modules which are executed one after the
other. The results of each component are gathered in the central chart
and can be used by any other component. The translation process can be
divided into five major steps:
Preparing the input text
The first step in preparing the input text for a translation is
performed by a Tokenizer, which reads an input file and splits
this up into separate items such as words, punctuation, numbers
etc. The input file is usually not in ASCII format, but rather a code
conversion from some encoding to Unicode has to be performed. The
tokenizer is a generic Unicode tokenizer, it is not specialized for
any language.
Morphological analysis and dictionary
lookup
Then, in order to be able to perform dictionary lookup, the inflected
surface words need to be processed by a Morphological Analyzer. We use a finite
state transducer with feature structures formalism called Samba
[Zajac:98] to describe morphological
properties of words. Figure 3 shows a simple
rule that describes the suffix which marks the causative form of
Persian verbs. For more information see
the web report on
Persian Morphology.
The dictionary itself is based on citation forms. It contains
approx. 50000 entries. Dictionary Lookup takes the citation
forms generated by the morphological analyzer and uses them to access
lemma definitions in the lexicon. The inflectional information gained
by morphology is then unified with the dictionary entry, rendering a
rich description of the input word. For a more detailed description of
the structure of the lexicon, see the web report on
the Shiraz dictionary.
Compounding is taken care of in the Compound lookup
component. Here, we are not looking for individual words in the
dictionary, but rather take any sequence of words to find
compounds. The compound lookup procedure is based both on citation
forms and surface forms, since some compound parts are not words on
their own right. We do not record the internal structure of compounds in
the dictionary, but since Persian is a head-final language, we assume
that the last element in a compound carries the most important
inflectional information. The compound inherits this inflectional
information, if possible.
CausativePastStem < GeneralRule;
CausativePastStem =
< RegularPresentStem
<"|n" "d">
[form.morph.infl: per.Form.VerbalInflection[
causative: True]]
>;
Syntactic Parsing
The parser employed in Shiraz is a unification-based, bidirectional
Chart parser. Figure 4 shows a simple syntax
rule for the composition of complex noun phrases. The rules are phrase
structure rules, and consist of a left hand side, which describes the
constituent being formed, and a right hand side, which describes which
subconstituents are used for the construction. Feature structures on
both sides allow to formulate restrictions and to build up
structure. The rules can be parametrized to allow for certain special
situations. First, they can be marked as non-recursive, in which case
they are not used to propose new categories more than once at the same
position. Second, they can be marked to perform dictionary lookup. If
this happens, the left hand side is considered to refer to a
dictionary entry and it is only constructed if there is an entry in
the dictionary which matches the citation form built.
In the Shiraz system, we use three incarnations of the parser to
perform different tasks. You can think of this as having a grammar
with different levels, each of which is applied in sequence. These
incarnations are:
complexNP = per.Rule.Rule[
lhs: per.Rule.NounPhrase[
head: #np1,
possessor: #np2],
rhs: <:
#np1= per.Rule.NounPhraseZero[
boundary: per.Type.FalseOrUndefined]
#np2= per.Rule.NounPhrase[
head: per.Entry.Entry[form.morph.lex.pos: per.Type.NounHeads]]
:>
];
Transfer
The Transfer component is used to transform Persian syntactic
structures to their English counterparts. Currently, we are only
performing lexical transfer, i.e. the Persian morphological
information is mapped to English inflectional features. Like all
components within the Shiraz system, transfer is based on the chart
notion. Incorporating syntactic transfer will allow to reuse partial
translations within larger constructs (cf. [Amtrup:95]).
Syntactic generation currently uses a simple method of linearization of English words. There is no complex mechanism to generate surface strings from syntactic descriptions. A sample rule for the generator is shown in Figure 5.
The rule demonstrates the three elements present in a generation
rule:
Apart from constructing surface strings from syntactic
descriptions, a morphological generation procedure is performed during
this phase. Thus, English words are generated with correct inflection.
The surface generation, finally, chooses the best path through the
graph of generated English surface fragments and issues these as
output. In the future, we plan to use an English language model to
choose among the many possible surface strings. The string which is
ranked best by the model will be issued.
Apart from a major renovation (the system was written in a short
period of time, which led to some suboptimal solutions and left almost
no time for optimization), the main components that could be added are
a model of syntactic-semantic transfer and a more elaborate syntactic
generation.
np1 = [
structure: per.Rule.NounPhrase[
head: #1= Top,
relClause: #2= Top],
order: <: #1 #2 :>,
trigger: "relClause"
];
System Statistics
The system is completely written in C++ (with the exception of a small
Java applet used to render Persian script for the glosser). It
consists of approx. 27000 lines of code. It can be run on both Unix
platforms (using the Gnu compiler) and PCs running Windows NT (using
Visual C++). Translating a sentence of medium length and complexity
(i.e., ambiguity) takes between 8 and 15 seconds.
Conclusion
Shiraz is a machine translation system for translating Persian written
text into English. It is based on two main architectural foundations:
The use of a chart throughout the system, which allows an integrated
view on results created on all levels of linguistic description, and
the use of a complex typed feature structure formalism, which unifies
the view on the descriptions itself.
References
Amtrup, Jan W., 1995
Amtrup, Jan W., 1997
Carpenter, Bob, 1992
Kay, Martin, 1980
Zajac, Remi, 1998
Top of Page