Consortium for Lexical Research
Newsletter 12

June 23, 1994


From the Computing Research Laboratory
New Mexico State University

Edited by: Katherine Mitchell and Jim Cowie

Contributions and inquiries to:
	lexical@nmsu.edu OR lexical@nmsu.bitnet

FTP address for accessing materials:
	clr.nmsu.edu [128.123.1.12].

Introduction

The Consortium for Lexical Research is designed to serve as a repository for software and resources of importance to the computational linguistics and natural language processing research community. CLR's objective is to help alleviate the repeated re-creation of basic software tools, and to assist in making essential data sources more generally available. For more information on the Consortium, please ftp to the site (address above) to obtain a copy of the CLR catalog. In the directory CLR/ it is available in a plain ascii file as catalog or in a postscript version, catalog.ps. Any questions about the archives or on becoming a member of CLR can be directed to lexical@nmsu.edu. This newsletter is also distributed from the ftp site. The directory is CLR/newsletter and the files to get are news12.txt, or news12.ps for a postscript version.

*************************************************

    Contents

  1. Using FTP


  2. CLR Mosaic Site: View The CLR Catalog On-Line

  3. Linguistic Tools for the Next Platform: SYNTACTICA and INTEX

  4. Spanish - English Parallel Text from PAHO

  5. Italian Wordform List (IWL)

  6. Electronic Dictionaries and Wordlists: Tagged Format Collins

  7. Recent Acquisitions: A Quick List
*************************************************

Using Anonymous FTP

Materials stored in the CLR are constantly being updated and new acquisitions are available. If you are interested in learning what these items are, you are welcome to ftp the CLR catalog. Anonymous ftp allows non-members access to the catalogs and some unrestricted data files. Here are the steps for using Anonymous FTP. It is recommended that you get the file README.clr.site for an introduction to using the archives. Members of CLR use a login name and a special password that they are assigned. Members can access certain directories that non-members are unable to use.



CLR Mosaic Site: View The CLR Catalog On-Line


CLR now offers a Mosaic WWW site. The site address for the Home Page is:
	//http:/crl.nmsu.edu/home.html

The Home Page has links for downloading a copy of the CLR catalog, or you can view it on-line. When you are viewing the catalog, it is possible to select an item held at the ftp site and actually download that item to your machine. For security reasons, this option is not available for those resources held in members-only.

The Home Page also has the most recent newsletters on-line, and the beginnings of a collection of "sample" files for lexicons and dictionaries. Right now, sample files for the Harper Collins bilingual German and bilingual Spanish show what these electronic dictionaries look like in their raw typesetters form. Very soon examples of the Collins tagged format will be available. In a section called "Other On-line Lexical Resources" CLR has begun to build a selection of ftp, gopher, and WWW sites which are of interest to lexical researchers.



Linguistic Analysis Tools for the Next Platform: SYNTACTICA and INTEX


Two text analysis software packages developed for the Next operating system are described below. Although NextStep hardware is no longer marketed, the OS has been ported to both the Sun and PC platforms.

SYNTACTICA

SYNTACTICA is a software application tool designed to be used in introductory linguistics classes, or classes with a syntax component. The program provides a simple graphical interface for creating grammars, for viewing the structures they assign to natural language sentences, and for transforming those structures by movement, deletion, copying, etc.

In SYNTACTICA, grammars consist of a set of context-free phrase structure rules and a lexicon. Sets of phrase structure rules are created in a Rule Window, using a rule template. The screen shot on the next page shows a Rule Window (the window says "Rules" above and you can see the rules which have been entered: S --> NP VP, NP --> Det N, NP --> N, etc., etc.) The rule template allows the user to specify familiar category information for PS-rules and which nodes in a rule are heads. This choice determines the path by which features are passed in a phrase-marker.

Lexicons are created in a Lexicon Window, according to a template for lexical items. The screen shot shows the Lexicon Window and the lexical items: John, Mary, cake, a, baked, for. The lexicon window is displaying information about the lexical item bake. The template permits the user to enter basic information about words, including category, feature and subcategorization information, and whether the lexical item is audible or phonetically null.

Once a rule-set and a lexicon are created, they can be used to generate phrase-markers for sentences. Rules and lexicon are first loaded into the Tree Viewer Window (see screenshot). The user enters a sentence and presses the Build Tree button, and SYNTACTICA generates a phrase-marker using the grammar that has been loaded. When more than one structure is possible, SYNTACTICA computes all phrase markers and displays the range in the Parse field. By clicking on Parse 0 or Parse 1, the user views the alternative structures. Various operations can be performed on the phrase markers: right and left adjunction, substitution, deletion, copying and indexing of constituents.

SYNTACTICA has extensive on-line help which can also be printed as a reference manual. An accompanying program, SEMANTICA, will be available in the early part of next year. SYNTACTICA and SEMANTICA were developed by Dr. Richard Larson and the SUNY-Stony Brook Semantics Lab under a grant from the National Science Foundation. It runs under the NextSTep operating system and utilizes an underlying Prolog engine, XSB, developed at the Dept. of Computer Science, SUNY-SB. Directory: members-only/tools/ling-analysis/syntax/

INTEX

INTEX is a corpora processor. It includes large-coverage dictionaries, several grammars which can be represented by graphs, and allows the user to build her/his own dictionary or grammar. At this time English, French and German dictionaries are included with INTEX.

INTEX automatically identifies words and morpho-syntactic patterns in large texts. The user can:

- Build a lexicon of words from the text. Terms may be simple words (e.g.: table), compounds (e.g.: word processor), or expressions (e.g.: to kick the bucket).

- Locate in the text all occurrences of a given word, even if it is inflected, or a given category, such as all feminine plural adjectives, or a morpho-syntactic pattern.

- Apply grammars, represented by graphs, to the text. It is possible to build indexes or concordances for all occurrences of the previous patterns.

- Use local grammars to remove word disambiguities in the text, or to detect errors or deviant sequences. The user can create her/his own grammar or edit one of the built-in grammars.

A user begins work by uploading a corpus and selecting the language. INTEX counts the number of tokens, number of different tokens, and sorts them by frequency. The user then selects linguistic tools to parse the text. The tools are either dictionaries or finite state transducers (FSTs). INTEX is based on 2 large-coverage dictionaries. The DELAF dictionary contains 700,000 simple words; each entry is accompanied by canonical form, part of speech, and inflectional information. The DELACF dictionary contains over 100,000 compound terms, mostly nouns. The FSTs are entered into INTEX either by editing regular expressions, or by drawing recursive graphs. Basically, the "input" part of a FST is used to identify occurrences of words in texts; the "output" part is used to associate each identified occurrence with information. By applying dictionaries and FSTs to a text, the user builds a lexicon of that text.

From the above lexicon, the user can locate morpho-syntactic patterns in the corpus, and index or build a concordance of these patterns. A pattern could be a syntactic pattern represented by a regular expression, such as: (+). This pattern matches any sequence of words beginning with a form of the verb etre, followed by an adverb, followed by a determiner and then a noun. More generally, the user may apply to the text grammars expressed by recursive graphs. Graphs typically represent pieces of a large coverage grammar of the language. The graphs in INTEX are easily edited. Standard operations such as union, intersection, differences, etc., help to build an easy to maintain system of hundreds of elementary graphs. Some graphs have been constructed and are included.

INTEX was developed by Professor Max Silberztein, at the Laboratoire D'Automatique Documentaire et Linguistique, Universite Paris 7. Licensing through CLR.



Spanish-English Parallel Text from PAHO


The Pan American Health Organization, Conferences and General Services Division, has given permission to CLR to distribute about 200 documents of parallel Spanish and English text. The text was translated by the PAHO translation staff using their SPANAM machine translation system with post editing by human translators. Most text was originally in Spanish, but this is not consistently the case. The documents are memos, letters, reports, conference proceedings, etc., on a wide variety of topics in the domains of Public Health and Latin America. There are about 180 pairs of text, 360 individual files, which amount to approximately 8 Mb of data. The Spanish documents do contain the Spanish character encoding. Other formatting commands, such as tabs, centering, bold, etc., have been removed.

Special thanks to Dr. Marjorie Leon for her assistance in making these texts available to the nlp research community.

Directory: members-only/lexica/PAHO/



Italian Wordform List (IWL)


A list of Italian wordforms, with about 30,000 entries, has been included in the CLR archives. This lexicon was created by Professor Rodolfo Delmonte, at the Instituto di Linguistica e Didattica delle Lingue, Universita degli Studi, in Venice, Italy. The Italian Wordform List (IWL) was derived from a 500,000 word corpus. An attempt was made to use broad based texts and cull a vocabulary which represents the most frequently used Italian words in written text today. The original corpus was composed of: the novel "La Coscienza de Zeno", by Italo Svevo; magazines with a popular focus; newspapers; monthly magazines on science and computing; and political documents. The corpus was automatically tagged with a part of speech tagger called IMMORTALE, which will also be available through CLR. Only a few corrections were made to the automatic output, and these were limited to a small subset of "hard to tag" terms.

The wordform list is "encumbered" and there is a fee of $300.00 for academic research use and $500.00 for corporate research use of this item. The license agreement form and explanation of payment of the fees can be found in:

Directory: members-only/lexica/IWL/

An example randomly excerpted from the wordform list is shown below. The tagset list is provided in the README file which is also in the above listed directory. The format of the list is: word, tags, frequency, and a letter which designates which type of corpus it was found in. The files are ASCII, tab delineated, and available as DOS zipped, Mac sea.hqx, and Unix compressed format. -------------------------





Electronic Dictionaries and WordLists: Tagged Format Collins


Harper Collins College Bilingual Dictionaries in Tagged Format

Harper Collins Publishers allows the Collins College Edition Bilingual dictionaries to be made available for research use through CLR. The dictionaries in this series are smaller than the large bilingual editions which are also available electronically (see Newsletter 11). The College Bilinguals contain approximately 80,000 references, 40,000 per side. Collins charges L1000 pounds sterling for academic research use and requires a signed license agreement. Additional information and application forms can be obtained from CLR. The languages available are:

  1. English - French / French - English
  2. English - Italian / Italian - ENglish
  3. English - German / German - English
  4. English - Spanish / Spanish - English
  5. English - Portuguese / Portuguese - English
  6. French - German / German - French
  7. French - Italian / Italian - French
  8. French - Spanish / Spanish - French

The above dictionaries are available in a tagged format. The tagging is similar to SGML, but is proprietary to Collins. The tagging system is not designed for linguistic analysis, but rather for the offset printing of the paper dictionaries. It does however lend itself to work in Computational Linguistics. Sample files excerpted from 3 of the dictionaries, and their accompanying documents explaining the tags that were used, are housed at the CLR ftp site. Read the file in Info:/ called COLLINS.college and look in the directory:

members-only/lexica/COLLINS.DICT.samples/college_bilingual/

Below is a brief excerpt from the French - Italian bilingual sample file, followed by a portion of the file which explains the tags. These are brief examples, please get the complete sample files from the CLR site.

***********************************************************
(COMMON)
*
(HWME) eau-de-vie
(PRON) odvi
(MAIN) eau
(MNHN) 1
(HWIF) ~x-~-~
(IFGR) pl
(POSP) nf
(TRAN) acquavite $
(TGGR) f
************************************************************
(COMMON)
*
(HWME) echauffer
(PRON) e$ofe
(POSP) vt
(LBIN) aussi fig
(TRAN) scaldare
*
(RFVB) s'echauffer
(POSP) vr
(LBSF) SPORT
(TRAN) riscaldarsi
*
(LBIN) dans la discussion
(TRAN) scaldarsi
(TRAN) accalorarsi
***********************************************************
(COMMON)
*
(HWME) ecrire
(PRON) ekRiR
(POSP) vt, vi
(TRAN) scrivere
*
(RFVB) s'ecrire
(POSP) vr
(LBIN) rciproque
(TRAN) scriversi
*
(PHRS) a s'ecrit comment?
(TRAN) come si scrive?
*
(BFORMAT)
*
(PHRS) ~  qn (que)
(TRAN) scrivere a qn (di)
***********************************************************
(COMMON)
*
(HWME) ecrit
(HWAD) e
(PRON) ekRi, it
(MAIN) ecrire
*
(BFORMAT)
*
(POSP) pp
(XROF) ecrire
*
(COMMON)
*
(POSP) adj
(HWXT) bien ~
(TRAN) ben scritto*
(TRSB) a
*
(PHRS) mal ~
(MAIN) ecrire
(TRAN) scritto* male
(TRSB) a
*
(POSP) nm
(TRAN) scritto
*
(PHRS) par ~
(MAIN) ecrire
(TRAN) per iscritto
*******************

Excerpt from the documentation from Collins describing the tags shown above (excerpt - not the entire file):

B FORMAT FRENCH-ITALIAN DICTIONARY 08-07-93

(HWME) Main entry headword positioned full out in HEADWORD BOLD.

(HWAD) Headword add on ending to be positioned after the headword or alternative form, separated by a comma and a character space. To be set in HEADWORD BOLD.

(PRON) Phonetics surrounded by square brackets, following the headword string, (HW..), to which they belong preceded by a character space.

(MAIN) Main entry for grouping purposes NOT FOR OUTPUT. The contents of this tag should not be output.

(MNHN) Main entry homonym number NOT FOR OUTPUT. The contents of this tag should not be output.

(BFORMAT) This tag is used for grouping purposes and is NOT FOR OUTPUT.

(COMMON) This tag is used for grouping purposes and is NOT FOR OUTPUT.

(POSP) Part of speech marker should be output in ITALIC, generally preceded and followed by a character space. Where there is more than one occurrence of this tag in succession, the intervening punctuation should be a comma.

(TRAN) Translation to be output in ROMAN. Will normally be preceded by a character space and followed by either a comma, if the following tag, excluding any (TRAD)/(TRSB)/(TRCF), (TGGR) or (TL..) tags, is

(TRAN)/(TREQ)/(TRGL), a semi-colon, if an indicator (LB..) tag follows or full stop if it is the end of the entry.

(LBIN) Indicator - general to be output in SLOPED ROMAN within round brackets, preceded and followed by a character space, unless following a (T...) or (X...) tag where it would be preceded by a semi-colon and character space.

(RFVB) Reflexive verb to be output in SECONDARY BOLD and separated from the preceding item by a semi-colon and a character space, except when the preceding item is (POSP), in which case it should be separated by a comma and a character space.

(PHRS) Phrase to be output in SECONDARY BOLD and separated from the preceding item by a semi-colon and a character space.

Other Collins Dictionaries in Tagged Format

The Collins COBUILD English Language Dictionary which we described in the last newsletter, is also available in a tagged format. COBUILD was developed from the Collins Bank of English, and has over 70,000 references. Currently, the Collins English Dictionary (CED), Third Edition, is only available in typesetters format as described in our last newsletter. However, it will be out in a tagged format sometime later this summer. The price for both dictionaries will be the same, L2000 for academic research use.

Longman's Dictionary Error Guide

Dr. Robert Krovetz at the University of Massachusetts, Dept. of Computer Science, has prepared a very useful guide to the errors found in the machine-readable version of the Longman Dictionary of Contemporary English (LDOCE). The guide refers to the first edition of the dictionary (1978), and in particular the "lisp" version. The guide has a section on translation errors, those errors which resulted from the conversion of the original tape into Lisp s-expressions. Another section has listings of errors found in particular fields: part-of-speech, subject codes, selectional restrictions, definitions, and run-ons. The pronunciation field and the grammar codes are not examined, nor are errors identified with the box codes on morphology. Dr. Krovetz will be releasing a revised edition of the guide shortly which will also be available through CLR. More Info: info/LDOCE.guide

Ftp Directory: CLR/resources/

Gaelic Dictionary

An electronic version of Alexander McBain`s "An Etymological Dictionary of the Gaelic Language" is available thanks to Kevin Donnelly who typed it in. The file is in ASCII, so all the typographic representations are in ASCII. The dictionary includes some grammatical information, and complete definitions.

Ftp Directory: CLR/multiling/gaelic/

German Military Terms Dictionary

The ARI German->English On-Line Military Dictionary for MS Windows was developed by Jonathan Kaplan, the Army Research Institute, Alexandria, Virginia. This is a German-English glossary in either of two electronic formats: a MS Windows Write word processor document or a plain ascii file. When used under Windows as a Write application, paging and searching functions are available. The glossary has approximately 6,200 German words. The terminology is largely military terms although a basic German vocabulary is also included. Each entry is one line of text, beginning with the head word followed by a simple part of speech, then a brief definition or equivalent in English. Standard German orthography, i.e., the umlaut and "es-tset" (a), is retained throughout. A special thank you to Dr. Melissa Holland for making ARI resources available.

Ftp Directory: members-only/lexica/German.dict.mil/

Wordlists in CLR

This is just a reminder that CLR houses a variety of wordlists, and lists of proper names, countries, currencies, etc.

Countries and Currencies: there is a collection of files listing country names, currencies, and country and currency codes. They are in the directory: members-only/lexica/country.currency.codes. Included are the SWIFT currency codes, and a list of singular and plural currency names extracted from the CIA World FactBook. There are files of ISO and FIPS country listings, with codes. These files were provided to CLR by the Department of Defense.

The Gazetteer: The Tipster Project Gazetteer is a compilation of gazetteer information from a variety of sources, primarily the CIA's RWDB2, the US Geological Survey, the CIA World Factbook, and the Board of Geographic Names namelists. Version 4.0 of the gazetteer has over 240,000 place names. It is in: CLR/members-only/lexica/wordlists/gazetteer/.

Proper Names - People and Corporations: the personal names lists were produced from various sources, including the combined student directories of Cornell, UNC Chapel Hill and NMSU, for the Tipster Project. There are lists of first names, last names, and a short list of titles. These are in members-only/lexica/wordlists. The subdirectory corporations/ contains a file with over 50,000 corporations listed with their countries of origin, along with a glossary of corporate abbreviations and designators. The corporation names were compiled by Eric Iverson at New Mexico State University.

German Wordlist: This is a german wordlist for ispell made available by Geoff Kuenning. It is more helpful than most wordlists because it is split into a number of separate files, including abbreviations and acronyms, geographic names and names, technical terms, adjectives, conjugated verbs, compound words, and then the file "all other words". It is in the directory: CLR/lexica/wordlists/german.dict/

Wordlists from the Fifth Message Understanding Conference: Files gathered for MUC-5 were later released to all members of CLR. Files include a nationalities list for 216 countries, with both noun and adjective form of the nationality. Also there are two files on organization names, one listing UN organizations and the other 187 international organizations. BBN Corporation deposited files of corporation names in Japanese and English, and Japanese human names and place names. These files are in the directory: members-only/lexica/MUC5.wordlists/.



Recent Acquisitions: A Quick List


Fonts ISO - LATIN 8859

Ftp Directory: multiling/fonts.iso.latin/

Set of fonts for ISO - LATIN 8859; 1 through 9, plus cyrillic, greek, and hebrew. It looks like these fonts will cover all of the Baltic and Eastern European languages in addition to the Western European ones.

Geta_Run

Ftp Directory: members-only/tools/text-analysis/Geta_Run/

Geta_Run is an experimental multilingual system for text understanding which was developed by Professor Rodolfo Delmonte at the Universita Degli Studi di Venezia, Instituto do Linguistica e Didattica delle Lingue. Geta_Run represents a linguistically based approach to text understanding that addresses the need to restrict access to extralinguistic knowledge of the world by contextual reasoning; i.e. reasoning from linguistically available cues. It is intended to show how linguistic knowledge can be put to use and external knowledge of the world accessed "only when needed", parsimoniously and independently, by the system. At present Italian, English and German are implemented and all three languages have limited but updatable lexicons. The parsing system is based on the LFG theoretical framework. Basic grammatical representation modules are the lexicons, and C-structure and F-structure, which are internally represented as graphs. The parser is a DCG which exploits the properties of Prolog in its general parsing strategy. Geta_run is written in Prolog for the Macintosh platform, and has accompanying documentation. Please be sure to sign the license agreement if you wish to experiment with this software. More Info: info/GETA.RUN.

ETL Parser

Ftp Directory: members-only/tools/ling-analysis/syntax/ETL/

The ETL parser is a parsing engine for augmented context-free grammar (CFG). It is a completely parallel parsing system which uses the Early algorithm and is designed for languages such as Japanese which are not word delimited. The ETL parser treats each word entry as a grammar rule, and each character used in words is written in a dictionary. A sample grammar and dictionary for the analysis of Japanese verbal phrases are distributed with the parsing engine. The ETL parser has 2 versions: one for 1-byte coded character strings, and the other for 2-byte coded which can handle sentences written in Kanji or Kana. The software was developed by Dr. Hitoshi Isahara, at the Electrotechnical Laboratory (ETL) of the Agency of Industrial Science and Technology, Ministry of International Trade and Industry in Tsukuba, Japan. A license agreement for the use of the software is housed in the directory listed above, and must be signed and returned to ETL. A paper describing the techniques used by the ETL paper is also included. More Info: info/ETL.parser



CLR Membership and New Members


The members-only area of the CLR archives is rapidly increasing its volume with valuable materials and software which are available only to members of the Consortium. If your interests lie in lexical research, computational linguistics, or natural language processing, CLR encourages your organization to become a member. Membership not only provides your organization with resources, but allows this ftp site and its services to be maintained and to grow.

Welcome to new CLR member organizations and their contact staff:

Dr. Christian Boitet, of the Groupe d`Etude pour la Traduction Automatique, Institut IMAG, Domaine Universitaire, Grenoble, France.

Dr. Enrique Daltabuit, Director, and Randall Sharp, Computational Linguist, of the Direccion General de Servicios de Computo, Academico, Universidad National Autonoma de Mexico, Mexico.

Dr. Rodolfo Delmonte, of the Istituto di Linguistica e Didattica delle Lingue, Universita degli Studi di Venezie, Venice, Italy.

Dr. Abolfazl Fatholahzadeh and Dr. Claude Lhermitte, SUPELEC, Ecole Superieure d'Electricite, Metz, France.

Dr. Michael Hess of the Fachbereich Informatik, Institut fur Computerlinguistik, Universitat Koblenz-Landau, Koblenz, Germany.

Dr. Hwee Tou Ng and Dr. Khee Yin How of the Defense Science Organization, Computer Research Division, Republic of Singapore.

Dr. James Pustejovsky, of the Computer Science Department, Brandeis University, Waltham, Massachusetts.

Dr. Pennelope Sibun of Fuji Xerox Company, Ltd., Palo Alto, California.

Dr. Stan Szpakowicz, of the Knowledge Acquisition Laboratory, Department of Computer Science, University of Ottowa, Ottowa, Ontario, Canada.

Dr. Yorick Wilks, of the Institute of Language, Speech and Hearing, the University of Sheffield, Sheffield, England.

Dr. Dekai Wu, of the Computer Science Department, Hong Kong University of Science and Technology, Hong Kong.