Work in computational linguistics has reached the point where the performance of many natural language processing systems is limited by a lexical bottleneck. That is, such systems could handle much more text and produce much more impressive application results were it not for the fact that their lexicons are too small.
The Association for Computational Linguistics proposed that a Consortium for Lexical Research (CLR) be established, with funding from ARPA. The CLR was set up in July of 1991 and situated at the Computing Research Laboratory, New Mexico State University, Las Cruces, New Mexico, USA.
Any individual or organization wishing to make initial contact with the CLR and review its procedures, holdings, agreements etc. should send email to the address above (or write to the mailing address or fax).
The objective of the Consortium for Lexical Research is to act as a clearing house, in the US and internationally, for lexical data and software. It shares lexical data and tools used to perform research on machine-readable dictionaries and lexicons, as well as communicating the results of that research, thus accelerating the scale and speed of the development of natural language understanding programs via standard lexicons and software.
A basic premise of the proposal for cooperation on lexical research is that the research must be precompetitive. That is, the CLR does not have as its goal the creation of commercial products. The goal of precompetitive research is to augment understanding of what lexicons contain and, specifically, to build computational lexicons having those contents. Members of the Consortium contribute to a repository and withdraw resources from it in order to perform their research. There is no requirement that withdrawals be compensated by contributions in kind. Members are charged an annual fee to help support the cost of running the CLR.
The task of the CLR is primarily to facilitate research, making available to the whole natural language processing community certain resources now held only by a few groups that have special relationships with companies or dictionary publishers. There is also an underlying theoretical assumption or hope: that the contents of major lexicons are very similar, and that some neutral, or polytheoretic, form of the information they contain can be at least a research goal, and would be a great boon if it could be achieved. The CLR as far as is practically possible accepts contributions from any source, regardless of theoretical orientation, and makes them available as widely as possible for research.
CLR set up publicity networks to attract interested donors of materials and members. From there the Consortium defined agreements for donors and members, a fee structure and set up computer networking facilities to carry out donations and withdrawls of materials.
A major activity of the CLR is to negotiate agreements with providers on reassuring and advantageous terms to both suppliers and researchers. Major funders of work in this area in the United States have indicated interest in making participation in the CLR a condition for financial support of research.
The Computing Research Lab (CRL) has a range of machines
appropriate for advanced computing on dictionaries (including the
construction of large-scale matrices): ARPA-supported access to a
Connection Machine, and a Sequent Symmetry,
and an IBM-ACE parallel machine, as well as network of
UNIX workstations. The Consortium has access to an appropriate range
of large-scale storage machines, and capacities for accepting and
providing materials by network, tape and CD.
The CLR archives include two main areas: a public area and one for members only. These are the repository for such lexical items as
Repository management involves cataloging and storing material in disparate formats, and providing for their retransmission (with conversion, where appropriate tools exist). In addition, a library of documentation describing the repository's contents and containing research papers resulting from projects that use the material is maintained. A brief description of the services provided is as follows
Progress during the year has been achieved in the three areas that correspond to these data bases
Building the contact base has entailed publicizing the Consortium and its purposes to the research community. A list of addresses was compiled. Printed and email announcements were composed and distributed, with an email address for responses. The announcement was posted in relevant newsletters and journals. To date there have been three large-scale mailouts.
Response has been enthusiastic and continuous. Conference presentations and personal contacts concerning the Consortium have included ACH/ALLC and others, in the United States, Europe, Japan, and elsewhere. Particular attention has been directed to reaching core researchers, building the current mailing list to over 500.
The mail directed to lexical@crl.nmsu.edu or to the
Consortium staff has been answered individually, with queries about
what people are interested in and what they might like to contribute
to the archives. The responses indicate that there is a great variety
of lexically related software needed and available.
Setting up procedures for the receiving of materials and their legal protections has led to formulation of drafts and membership and provider agreements. The agreements have been finalized and legally approved at NMSU and memberships have been accepted (see below). The major problem, which has meant an enormous amount of negotiation with major publishers, has been creating a general form of provider agreement that captures the interests of the major dictionary publishers in a general way, not tailored to each. As reported below, there has been substantial progress.
Facilities for receiving and providing archival materials have also been set up. The directories and file transfer procedures are in place. Besides online access and deposit of materials, tape and diskette have been anticipated, all in a number of formats. Heavy security has been set up for heavily encumbered materials.
Software for handling and classifying correspondence has been written which will permit cross-classification and sorting of member entries to match user needs and user offerings. Written in-house reports of CLR operations have been made regularly. Other software which will enable handling of materials in varied scripts is also under development, so that materials with a variety of orthographies can be transmitted, etc. (Scripts include Japanese, Chinese, Korean, Cyrillic, some Indic, and other scripts. This is an item of current interest in the lexical research community.)
One of the goals of the Consortium is to make electronic versions of dictionaries and thesauri available within the research community, and discussions and visits have continued with Oxford University Press, Collins Publishers, and Longman Group Limited. In brief, CLR now has arrangements with Harper-Collins which facilitate the purchase of their machine readable dictionaries by members. The Consortium expects to reach that stage soon with Longmans and Oxford. All this has been slower than hoped. As these negotiations are in final stages, CLR is turning to the major US publishers.
Materials in the CLR archives are secured by a ``protection in depth'' scheme. On the most public level, freely distributable materials are available via anonymous ftp. CLR maintains only a log of recent contacts for these materials. For the protection of lightly encumbered materials, CLR provides members of the Consortium with individual ftp accounts by which they can access archived material. These materials are kept separate from the publicly accessible materials and are protected by standard ftp accounting and permission software. These accounts are only valid for ftp transfer, and their passwords are changed regularly by the Consortium.
On the highest level of security, members who have received permission from the supplier of heavily encumbered material are given a special temporary ftp account which allows them access to encrypted versions of the heavily encumbered material. To obtain the material illegally, not only would the normal file permissions scheme have to be subverted, but a highly secure cryptographic system must be defeated. As an additional security measure, the files are periodically re-encrypted with freshly generated random passwords. At no time is an unencrypted version of the material stored in the ftp accessible archives.
CLR is now the center of distribution of the data required by participants in the Fifth Message Understanding Conference (MUC-5). These research groups are all working on Information Extraction from actual texts. The performance of their systems will be evaluated in August 1993. CLR's facilities provided a secure and carefully monitored means of distributing the large volumes of data, such as gazetteers, rules, training texts, etc., required to build the IE systems.
The Consortium for Lexical Research currently has in its
public archives contributions of 145 different packages for lexical
use. Of these 145, eighty are restricted to MUC-5, eleven are
restricted to members-only (lightly encumbered) access and
the remainder are available to anyone (unencumbered). CLR has placed
in the heavily encumbered category of the archive a recent version of
the Alvey tools. JUMAN, the segmenter and part-of-speech
tagger for Japanese and the Xerox part-of-speech tagger have
both been placed in the lightly encumbered category, for members-only
access. The materials in the MUC-5 area include a database and tools
for assessing the message understanding software designed and used by
MUC members.
Contributions include thesauri (Roget's Thesaurus and WordNet), dictionaries (Collins English-Spanish Bilingual Dictionary and Longman's Dictionary of Contemporary English), wordlists (Gazetteer, Proper Names, and the Standard Industrial Classification Manual), technical reports (all software and lexica include technical reports relevant to the materials being investigated), morphological analyzers (ENGLEX), lexical parsers (SGML parser for text processing), typesetting software (Indian, Arabic, Korean, Vietnamese, Japanese, and Chinese fonts), dictionary interface tools (BYU's Morphogen), text analyzers (Interlinear text processor from SIL), and a phonological programming language.
Approximately 50 other contributions are in various stages of negotiations at this time, including the text of German, Greek, and Latin Vulgate Bibles, the American Heritage Dictionary, POPX, a Russian-English dictionary of political terms, and a large Chinese-English dictionary.
The CLR currently has 48 member organizations which include 22 domestic universities, 4 foreign universities, 8 government agencies and 18 commercial companies(including Apple, Microsoft, Xerox). MUC-5 participants have all joined the CLR, providing valuable software and materials. Currently 4 more memberships are pending signature by NMSU and 8 organizations have indicated that they would like to join, but CLR has not yet received membership agreements from them.
CLR's most recent and major events are two workshops. The first was a CLR workshop in January of 1992 which brought publishers, researchers, funders and users of lexical materials for three days to New Mexico, with the support of the ACL and the NSF. The major issues of lexicon re-usability and the problems of copyright/ownership of materials were discussed extensively and full notes of the discussions and presentations at the workshop will be distributed as a technical memorandum. The second workshop entitled U.S./European Cooperation took place in January 1993. The workshop was sponsored jointly by NSF and the European Commission to discuss international cooperation in lexical computation. Twenty-five researchers participated in the workshop. The following technical reports are available from the address at the top of this document