Persian-English Tagged Test Corpus


Bilingual Corpus

The parallel test corpus consists of 3,000 Persian sentences with corresponding English translations. The sentences are representative of contemporary journalistic prose; they were all collected from the on-line Iranian newspaper Hamshahri. All sentences were manually translated at CRL.


Tagging

The Persian sentences have been manually tagged for Part-of-Speech and bracketed for Noun Phrases and Preposition Phrases. The English sentences have been tagged automatically for POS and phrases using Upenn's SuperTagger.
The sentences below represent an example of a tagged Persian sentence and its corresponding English translation:

  • [ brrsy h|y<n> |yn<det> frv^sg|hh|<n> ]np hm<av> n^s|n mydhd<lv> kh<con>
    [ |nglys<pn> ]np bh sr@t<av> tbdyl<lv1> [ bh<pre> J|m@h |y<n> ]pp my^svd<lv> [ kh<rel> hrgz<av> nmy xv|bd<v> ]np

  • StudyNNA_NXN ofINB_nxPnx theseDTB_Dnx storesNNSA_NXN showsVBZA_nx0Vnx1 thatINB_Dnx EnglandNNPA_NXN isVBZB_Vvx increasinglyRBB_ARBvx beingVBGB_Vvx transformedVBNA_nx0Vnx1 intoINB_vxPnx aDTB_Dnx societyNNA_NXN whichWDTB_COMPs neverRBB_ARBvx sleepsVBZA_nx0Vnx1 ..B_sPU

    Evaluations

    The bilingual tagged corpus was used in the Shiraz project for testing the results of the morphological analyzer, the dictionary entries and the syntactic parser.
    For testing the output of morphological analysis and dictionary lookup, the Glass-Box Evaluation system compared the results with the hand-produced annotations. The results were grouped as:
    The results of the Glass-Box Evaluation component were used to correct and edit any mistakes in the dictionary, in the stemmer or in the morphological component.

    Click to see a sample of the Glass-Box evaluations.

    For problems with this Web site, send mailto webmaster@crl.nmsu.edu

    Shiraz Home Top of Page