This is the CLR catalog: a Table of Contents is followed by the actual catalog with short paragraph length descriptions of the materials available in the CLR archives.
When you are in the catalog you may select an item and:
Below is the Table of Contents listing the CLR materials alphabetically. Each is linked to its section of the catalog. You may use the Table of Contents or go directly to the Catalog.
Ftp Directory An ascii text file of a very comprehensive list of acronyms; over 3300 entries. A wide variety of domains are covered, including business, science, medicine, government, and more. More info: here A brief sample: NAS National Academy of Sciences NAS National Advanced Systems NASA National (US) Aeronautics and Space Administration [Space] NASDA NAtional (Japan) Space Development Agency [Space] NASM National (US) Air and Space Museum [Space] NASP National (US) AeroSpace Plane [Space] NATO North Atlantic Treaty Organization
Ftp Directory Afgrep is a variant of the mgrep algorithm from the agrep package. It provides a high speed multi-string search of a file in the same manner as the fgrep program. It also has fewer arbitrary limitations. More Info: here.
Ftp Directory Agrep is a tool for high speed text searching, allowing for errors. Agrep is similar to fgrep, egrep, grep, but is more general and usually much faster. The three most significant features of agrep 1. The ability to search for approximate patterns 2. Agrep is record oriented, not just line oriented 3. Multiple patterns can be specified with logical operators (AND and OR) for queries. More Info: here.
Ftp Directory ArabTeX is a package extending the capabilities of TeX/LaTeX to generate the arabic writing from an ASCII transliteration for texts in several languages using the arabic script. It consists of a TeX macro package and an arabic font in several sizes, presently only available in the Naskhi style. ArabTeX will run with Plain TeX and also with LaTeX; other additions to TeX have not been tried. ArabTeX is primarily intended for generating the arabic writing, but the scientific transcription can be also easily generated. For other languages using the arabic script limited support is available. This package also has the option of typesetting fully vocalized text (with vowel diacritics as in the Q'uran) and/or transcriptions as well. More Info: here.
Ftp Directory ARCSGML is a set of tools for setting up and working with text that is tagged with your own specialized tags in SGML format. The tags permit you to label text structures (such as Part-of-speech, syntactic, morphological, semantic, or discourse structures). The parser is for selectively pulling out corresponding tagged pieces of text. The ARCSGML toolkit is for use in developing conforming SGML parsers, systems, and applications. A validator (for checking your tagged text) is supplied. It supports the standard SGML reference concrete syntax (beginning from 1983) in all features except LINK, CONCUR, and SUBDOC (although some hooks are in place to get you started on these). [The package was originally written to validate the 1983 working draft of the SGML standard, and was subsequently maintained to track the standard through its final phases of development, culminating in the amendment.] Executable sourcecode programs for versions for PC and Unix C (MSDOS REXX, MSDOS C, and Unix C) are provided. More Info: here.
Ftp Directory ATeX is a simple extension to LaTeX allowing the user to typeset, edit and print documents in Arabic, or in a combination of English and Arabic. Simple updates to one of the files will allow one to get ATeX to do Farsi, Urdu, etc., since the extra-Arabic characters for these are present in the font used. Also supplied with this package is the MSDOS executable for using it on PCs. More Info: here.
Ftp Directory The Attribute Value Parser provides a general tool for investigating unification-based theories of grammar, runs on Apple Macintosh computers, and was developed by Mark Johnson. It works with a user-defined grammar, specified in a file or constructed using the editor included, and constructs parse trees and feature structures from input sentences. Clicking on the nodes in the parse tree causes their associated feature structures to be displayed. There are two versions of the parser, corresponding to the two versions of Apple's CommonLisp environment that were used to create them. The 1.32 version was created with MACL version 1.32, and the 2.0p2 version was created with MCL 2.0p2. More Info: here.
Ftp Directory Bamboo Helper is a shareware program by Carlos McEvilly that transliterates Chinese text files into Pinyin, Wade-Giles, Yale or Zhuyinfuhao formats. The program segments Chinese text to add breaks between words, identifies the correct pronunciation for characters with more than one pronunciation, and contains a dictionary. Intended for students of Chinese, the program outputs vocabulary lists and flash cards. Bamboo Helper extracts Chinese text strings and ASCII text strings from binary files. With no editing functions, it is best used as a supplement to a Chinese display system word processing environment. More Info:/BAMBOO.
Ftp Directory CGParser is an implementation of a linear parser of Conceptual Graphs as described in John Sowa's book _Conceptual Graphs_. It was written using the YACC general-purpose language utility. Some simple and more interesting examples are provided for testing purposes. More Info: here.
Ftp Directory This packag, developed by Kosta Kostis, generates a converter based on character encoding description files, one for the source encoding, and one for the destination encoding. The description files are pure ASCII files using ISO 10646 names. The package includes 74 character encoding description files covering character encoding from: Adobe, Apple, Atari, DEC, EBCDIC, HP, ISO 646*, ISO 8859*, IBM Codepages, Microsoft Windoes Codepages, NeXT. This is an extremely useful package for researchers in multilingual text processing. There is also a subdirectory called "extras" with additional files of interest to people working with ISO character sets. More Info: here.
Ftp Directory This pronunciation dictionary was generated at Carnegie Mellon University for the purpose of tuning and developing speech understanding systems. The dictionary contains approximately 100k words and their transcriptions. Several independent sources were used in its development, such as , the UCLA Shoup dictionary, a subset of the Dragon dictionary, and various other dictionaries, that were hand-built, syntehsizer-generated, or generated with Orator and Mitalk. Robert Weide and Peter Jansen from CMU developed this dictionary for any purpose, commercial or otherwise. Version 0.1 and 0.2 are both in this directory. More Info: here.
Ftp Directory COGNATE is the implementation of a prototype algorithm for identifying related words across languages. Given the same list of words in two different languages, COGNATE will determine which words are likely to be regularly derivable from each other, and which are not. COGNATE is only available as an MSDOS executable and comes with Dutch, English, and German word lists. More Info: here.
Ftp Directory This "dictionary" is a set of Prolog facts derived from the first published edition of the Collins English Dictionary. It was originally created by Dr. Ed Fox and Dr. Robert Vance at Virginia Tech for the CODER lexicon Project. The factbase consists of 20 files, one for each relation identified in the structure of the Collins Dictionary. Each relation file consists of ground facts in Edinburgh standard Prolog syntax, one fact per line. The headword relation file has no accompanying data. In the rest of the files, the facts are of the form: name (where name identifies the relation); descriptor (where descriptor specifies both the associated entry and the depth within it at which the fact is bound); and data (data represents the data stored for that information). More info: here.
Ftp Directory The Collins English Dictionary, Third Edition, published in 1991, contains 180,000 references, 190,000 numbered definitions, 14,000 new or updated entries from the last edition, and 16,000 biographical and geographic entries. Harper Collins makes the CED3 available to researchers for a fee; please contact CLR for more information. This directory contains an electronic sample which shows what the ascii text typesetters tape format is like.
Ftp Directory Ftp Directory Harper Collins Publishers have a line of monolingual and bilingual machine readable dictionaries available through CLR. For complete information on product and pricing, please contact CLR. The files in this directory are electronic samples from two of the large bilingual dictionaries; the large German - English and the large Spanish - English. The large German - English dictionary contains over 280,000 references and 460,000 translations. The large Spanish - English contains over 230,000 references and 440,000 translations. The dictionaries are avialable on tape, in typesetters coded format. The purpose of the samples is to demonstrate the types of linguistic data available, and the format of the tapes. More Info: here.
Ftp Directory Harper Collins also makes available it's medium-sized bilingual dictionaries, the Collins College Edition Bilingual Series. These have approximately 80,000 references, 40,000 per side. They come in a 'tagged' format, with tagging somewhat similar to SGML. The tagging is done for printing purposes, not for linguistic analysis purposes. But it does lend itself to computational linguistic work. Some samples excerpted from dictionaries which show the tagging output, along with documents explaining the tags used, are available in the directory listed above. More Info: here.
Ftp Directory Conc produces concordances of texts. A concordance consists of a list of the words in the text with a short section of the context that precedes and follows each word. Conc also produces an index, consisting of a list of the distinct words in the text, each with the number of times it occurs and a list of the places where it occurs. Conc displays the original text, the concordance, and the index each in its own window. Clicking on a word in any one of the three windows causes the other two windows to display the entries for the same word. More Info: here.
Ftp Directory This is a collection of text file lists of countries, country codes, currencies, and currency codes. Included are the SWIFT currency codes with files arranged alphabetically by country, and alphabetically by currency code. There is also a file of single and plural currency names which was extracted from the CIA World Factbook. In addition there are files of ISO and FIPS country listings with codes. This collection was provided by the Department of Defense in connection with the MUC-5 conference. More Info: here.
Ftp Directory Collection of hand-tagged English and Japanese texts and an accompanying evaluation program for comparison of machine tagging. Texts were hand tagged for human names and organizations names. Scoring software allows evaluation of machine proper name tagging of same texts. Scoring report gives a value for recall and precision. More Info: here.
Ftp Directory Collection of tools to segment text and count the pieces. Programs are included that segment English text into sentences, words and word-level n-grams. In addition programs to count strings are provided, as well as programs which can statistically analyse the results of these counts. These utilities can be used with other natural languages if a word extraction program is available for use as a pre-processor. These programs were used to produce the data described in the paper ``Accurate methods for the statistics of surprise and coincidence'' which appeared in the March 1993 issue of Computational Linguistics. All programs are written in C and have been run under SunOS 4.1.1, but should be portable to other environments which support long integers as default and large arrays. More Info: here.
Ftp Directory Ftp Directory These word lists were produced from various sources including the combined student drectories of Cornell, UNC Chapel Hill, and NMSU, for the TIPSTER project. Lists of first names, last names, and personal titles are provided. In the corporations directory is a list of over 50,000 corporations and their countries of origin, along with a glossary of corporate abbreviations and designators. More Info: here.
Ftp Directory Ftp Directory DIMAP-2 is a set of PC dictionary creation and maintenance utilities that are modeled loosely and more flexibly on the utilities developed by Tom Ahlswede and described at ACL 85. DIMAP-2 also comes with a linked machine-readable dictionary (Merriam-Webster Concise Electronic Dictionary, with 80,000 entries). The design of the software is intended to facilitate the development of NLP lexicons in any formalism (following Allen, HPSG, lexical conceptual structures, ECD; using Lisp, Prolog, ASCII), which then belong to the developer. The software itself is obtainable for single users at $125 and for academic institutions at $500. Commercial copies are available for $2,400 for one copy and $6,000 for a site license. A demonstration version is also available. The software is currently being ported to the UNIX environment. An MSDOS executable demo is provided and some of the DIMAP-2 features are restricted for the demo. More info: here. Dimap-2 is also available for the Sun-4. More Info: here.
Ftp Directory EDICTJ is a small public-domain Japanese/English Dictionary (dual-language glossary) in machine-readable form authored by Jim Breen. Having the neat form of a dual-language, dual-script glossary, it can readily be used in any number of applications. [It was initially intended for use with MOKE (Mark's Own Kanji Editor) and related software such as JDIC.] More Info: here.
Ftp Directory Englex is a basic lexicon for morphological analysis of English text. It uses the standard orthography for English. It is intended for use with PC-KIMMO (or programs that use the PC-KIMMO parser, such as KTEXT). With such software and Englex, you can produce sets of records of the morphological constituents in English texts. Practical applications include morphological preprocessing of text for a syntactic parser and producing morphologically tagged text. Englex can also be used to explore English morphological structure. More Info: here.
Ftp Directory The Eng-Chi Dictionary, Version 1.0, by John Rittinghouse allows users to quickly find Chinese definitions for English entries. It allows on-line lookup of English words, with display in chinese characters, pinyin, Wade_giles, Yale romantization, Full dictionary consists of 107,750 entries. This is an MSDOS or Windows program. More info: here.
Ftp Directory The ETL parser is a parsing engine for augmented context-free grammar (CFG). It is a completely parallel parsing system which uses the Early alogorithm and is designed for languages such as Japanese which are not word delimited. The ETL parser treats each word entry as a grammar rule, and each character used in words is written in a dictionary. A sample grammar and dictionary for the analysis of Japanese verbal phrases are distributed with the parsing engine. The ETL parser can parse sentences written in EUC or Mule internal code. The software was developed by the Electrotechnical Laboratory (ETL) of the Agency of Industrial Science and Technology, Ministry of International Trade and Industry in Tsukuba, Japan. A license agreement for the use of the software is housed in the directory listed above, and must be signed and returned to ETL. A paper describing the techniques used by the ETL paper is also included. More Info: here.
Ftp Directory FLEX, Fast Lexical Analyzer Generator, was developed by Vern Paxson at the Lawrence Berkeley Laboratory in Berkeley, CA. Flex is a tool for generating programs which recognize lexical patterns in text. Flex reads the given input files for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code - these are called rules. Flex generates as output a C source file, lex.yy.c, which defines a routine yylex(). This file is compiled and linked with the library to produce an executable. When the executable is run, it analyzes its input for occurences of the regular expressions. Whenever it finds one, it executes the corresponding C code. More info: here.
Ftp Directory Fonol is a programming language for writing out and applying TG-style phonological rules (modelled on Chomsky and Halle 'Sound Pattern of English' and Schane 'Generative Phonology') to see their effect. It also incorporates the input and output filters (conditions) which came into common use about the same time. It is intended to aid students of phonology to grasp the ideas behind phonological rules and to help phonologists manage large complex bodies of rules in the theory of their choice. (Notation style modified for writing in IBM PCs.) More Info: here.
Ftp Directory Set of fonts for ISO - LATIN 8859 1 through 9. Also, cyrillic, greek, and hebrew. It looks like these fonts will cover all of the Baltic and Eastern European languages.
Ftp Directory French Plus! (authored by Gene Hayworth) is a tutorial and testing program, divided into three sections: Vocabulary Review, Vocabulary Exercises, and Verb Conjugation Exercises. The demo version, available for evaluation purposes only, contains approximately 35 words; the full version includes a combination of more than 800 nouns, adjectives, and commonly used verbs, with their conjugations in four tenses. Accent marks have not been included in order to make this program compatible with as many systems as possible. The program will run from either a floppy disk or a hard drive when installed in a single directory. More Info: here.
Ftp Directory FUF 5.2 and SURGE 1.2 were developed by Michael Elhadad, currently at Ben Gurion University of the Negev. FUF is an extended implementation of the formalism of functional unification grammars (FUG's) introduced by Martin Kay, specialized to the task of natural language generation. SURGE is a large syntactic realization grammar of English, written in FUF. SURGE is developed to serve as a "black box" syntactic generation component in a larger generation system that encapsulates a rich knowledge of English syntax. SURGE can also be used as a platform for exploration of grammar writing with a generation perspective. More info: here.
Ftp Directory This dictionary is an electroic version of Alexander McBain's "An Etymological Dictionary of the Gaelic Language", from 1911. It was typed in by Kevin Donnelly. The file is in ASCII, so all the typographic representations are in ASCII. For an excerpt, see the More Info file. More Info: here.
Ftp Directory The TIPSTER Gazetteer is a compilation and reformulation of gazetteer information from a number of sources, primarily CIA's RWDB2, the US Geological Survey, the CIA World Fact- book (version 11), and the Board of Geographic Names namelists. Coverage varies drastically, based on the degree of completion for the region or attribute. Version 4.0 has over 240,000 place names. Depending on the source, multiple entries may exist for the same geographic entity, under various spellings. More Info: here.
Ftp Directory The ARI German-English On-Line Military Dictionary for MS Windows was developed by Jonathan Kaplan, the Army Research Institute, Alexandria, Virginia. This is a German-English glossary in either of two electronic formats: a MS Windows Write word processor document or a plain ascii file. When used under Windows as a Write application, paging and searching functions are available. The glossary has approximately 6,200 German words. The terminology is largely military terms although a basic German vocabulary is also included. Each entry is one line of text, beginning with the head word followed by a simple part of speech, then a brief definition or equivalent in English. Standard German orthography, i.e., the umlaut and "es-tset" (a), is retained throughout. A special thank you to Dr. Melissa Holland for making ARI resources available. More Info: here.
Ftp Directory This is a German dictionary for ispell, originally created by Martin Schulz and made available by Geoff Kuenning. It contains a number of separate wordlists, including abbreviations and acronyms, geographic names and names, technical terms, and adjectives, conjugated verbs, compound words, and "all other words". More info: here.
Ftp Directory This German Stemmer, designed by Daniel Stieger from Institut fuer Informationssysteme, Zuerich, is available in Modula-2. Its design is based on the "Porter Algorithm". It includes only one semester's work and is, therefore, unfinished. The program uses an automatically generated dictionary of 215,000 German words for the decomposition task. The dictionary is available along with the software. A report written in German is available in hard copy. More Info: here.
Ftp Directory Geta_Run is an experimental multilingual system for text understanding which was developed by Professor Rodolfo Delmonte at the Universita Degli Studi di Venezia, Instituto do Linguistica e Didattica delle Lingue. Geta_Run represents a linguistically based approach to text understanding that addresses the need to restrict access to extralinguistic knowledge of the world by contextual reasoning; ie reasoning from linguistically available cues. It is intended to show how linguistic knowledge can be put to use and external knowledge of the world accessed "only when needed", parsimoniously and independently, by the system. At present Italian, English and German are implemented and all three languages have limited but updatable lexicons. The parsing system is based on the LFG theoretical framework. Basic grammatical representation modules are the lexicons, and C-structure and F-structure, which are internally represented as graphs. The parser is a DCG which exploits the properties of Prolog in its general parsing strategy. Geta_run is written in Prolog for the Macintosh platform, and has accompanying documentation.
Ftp Directory MSDOS executables and data for Grammar Tranformation System GRAMTSY interpret the more or less liguistically familiar notation of transformational grammars and applies the grammars so linguists may analyze and interpret the implications of intricate TG-rule or TG-grammar networks. More Info: here.
Ftp Directory "Greek" is a (di)troff filter that takes Greek text written with Latin characters using the typewriter letter correspondence, and converts it into the corresponding character sequences for the Greek letters. "Greek" follows the ``monotoniko'' (single-accent) system. Text may have intermixed Greek and Roman characters. More Info: here.
Ftp Directory These are the half-Uncial (Irish typeface) fonts pre-generated at 300dpi for use with (La)TeX. More Info: here.
Ftp Directory This is a list of homophones in "General American English", based on the book HANDBOOK OF HOMOPHONES by William Cameron Townsend, 1975. The list contains words that sound the same (or very nearly the same) but are spelled differently. It occasionally includes spelling variants of the same word when there is another word in the same entry; the only difference between "homophones" and "spelling variant" is whether or not the words are lexically "the same". The list also contains a few common proper names. This list of homophones was provided by Evan Antworth from the Summer Institute of Linguistics. More Info: here.
Ftp Directory This is the "hum" concordance and textual analysis package done by Bill Tuthill when he was at Berkeley (1981). A package of programs for literary and linguistic computing, emphasizing the preparation of concordances and supporting documents. Both keyword in context and keyword and line generators are provided, as well as exclusion routines, a reverse concordance module, formatting programs, a dictionary maker, and lemmatization facilities. There are also word, character, and digraph frequency counting programs, word length tabulation routines, a cross reference generator, and other related utilities. The programs are written in the C programming language. More Info: here.
Ftp Directory Interbas is a natural language front end for relational databases in DBase compatible format. The program was a finalist in the "Software in Europe" contest at CeBIT 93 in Hanover, Germany. This demo contains both an English and a Russian version of the InterBase system; additionally there are several demo linguistic processors for various applied software systems. More Info: here.
Ftp Directory
The IT ('eye-tee' for Interlinear Text) software is a set of tools
from SIL [in executable MSDOS binary code] that are for developing a
corpus of annotated interlinear text -- for what linguists, literary
scholars, translators, and anthropologists call the "glossing" of text.
Primary among these tools is `itp'. The interlinear text file produced
using itp is a clean ASCII file which is accessible by other text
processing software for purposes such as concordancing, indexing, or
display formatting. In addition to itp, the IT package includes a
collection of other software tools which support the conversion of
conventional texts to interlinear text format and which support the
maintenance of the auxiliary lexical database files. IT views text as
a sequence of text units, each of which contains a text line plus a
multidimensional set of annotations entered according to a model
provided by the analyst. In addition to word and morpheme level
annotations, the IT system supports freeform annotations of the whole
text unit, such as translations. More Info: here.
Ftp Directory Italian Plus! (authored by Gene Hayworth) is a tutorial and testing program, divided into three sections: Vocabulary Review, Vocabulary Exercises, and Verb Conjugation Exercises. The demo version, available for evaluation purposes only, contains approximately 35 words; the full version includes a combination of more than 800 nouns, adjectives, and commonly used verbs, with their conjugations in four tenses. Accent marks have not been included in order to make this program compatible with as many systems as possible. The program will run from either a floppy disk or a hard drive when installed in a single directory. More Info: here.
Ftp Directory The IWL is a lexicon of Italian part of speech tagged words, created by Professor Rodolfo Delmonte at the Universita degli Studi di Venezia, Instituto di Linguistica e Didattica delle Lingue, at the Laboratorio di Linguistica Computazionale. The IWL was derived from a 500,000 word corpus. Broad based texts were used, and a vocabulary was culled which represents the most frequently used Italian words in written text today. The lexicon is encumbered, and there is a fee of $300.00 for academic use and $500.00 for corporate research use of this item. The license agreement, a sample excerpted from the list, and a readme file that lists the tags being used can all be found in the directory listed above. More Info: here.
Ftp Directory This is a package for printing Indian language script text. It only does the transliteration mapping--each letter in an Indian language is assigned as English equivalent, but the actual printout is in an Indian language script. Font support is available for two versions of the Devanagari script (one of which was developed by Frans Velthuis), as well as for Tamil, Telugu and Bengali. The preferred input interface is TeX, though a dumb textual interface is available for the PostScript Devanagari font. More Info: here.
Ftp Directory This is a Japanese/German dictionary entered by Helmut Goldenstein from the book "Langenscheidts Lehrbuch und Lexikon der Jap. Schrft", author Wolfgang Hadamitzky. It contains 11627 Japanese words and 22000 German translations. This dictionary may not be available after December of 1992, so please make SURE you read the documentation file that comes with it. More Info: here.
Ftp Directory This is a Japanese morphological dictionary with a search program included for accessing it. The documentation is all in Japanese. More Info: here.
Ftp Directory These vocabulary lists were copied from the Vocabulary Summary at the back of Mangajin magazine by Lars Huttar. The words listed have three fields; the Kanji written form, the Hiragana pronunciation, and the English definition. There are 22 lists of about 100 words each. The lists were originally created to be used with vocabulary drilling software. More Info: here.
Ftp Directory Jkwic is a simple Key Word In Context program that has the capability of working with EUC Japanese. It allows both simple keyword searches and limited regular expression specification of the keyword. More Info: here.
Ftp Directory Juman is a program which segments Japanese into words and tags these words with parts of speech. It was produced at Kyoto University and then heavily modified by researchers at MCC. The tables used for tagging are generated by a prolog program, but the program which actually does the tokenizing and tagging is written in C, so that users do not need to have a working prolog implementation if they just want to use Juman. More Info: here.
Ftp Directory Juman-1.0 is the newest version of the segmenter-tagger for Japanese. It includes the JUMAN-MCC version. In addition, it provides a 120K dictionary. Other minor changes have been made so that the software is more convenient to use. More Info: here
Ftp Directory JWP is a Japanese word processor for MS Windows, release V1.1, by Stephen Chung. It supports all windows printers with fully scalable fonts (ie TrueType). The program comes complete with source code, and the ability to use the Japanese - English freeware dictionary EDICT. Among its features are dynamic and user-defined kana->kanji conversion, kanji information such a stroke count and bushu, entry of JIS characters thru JIS code or a table, and most of the standard Windows word processor features. More Info: here.
Ftp Directory These are frequency-sorted lists of kanji apprearing in a sample of about 30,000 articles from the USENET group "fj". One file is for all kanji, and lists frequency rank, JIS code hexadecimal, the kanji, number of times it appeared in sample, and percentage of the kanji covered by this and all previously listed kanji. The second file sorts frequency of kanji compounds and lists the top 1000 most frequent kanji strings and the top 500 2-character Kanji compounds. The software to process text and extract Kanji frequencies is included. These files were vreated by Tim Burness, and the work was supported by TWICS Co., Ltd. More Info:/KANJI.freq.
Ftp Directory KGEN is an auxiliary program for PC-KIMMO. PC-KIMMO is a program for doing computational phonology and morphology. KGEN is typically used to build morphological parsers for natural language processing systems. The KGEN program which this document describes will be of very little use to you without the PC-KIMMO program and book. The PC-KIMMO software is available for MS-DOS (IBM PCs and compatibles), Macintosh, and UNIX. ["PC-KIMMO: a two-level processor for morphological analysis" by Evan L. Antworth, published by the Summer Institute of Linguistics (1990). More Info: here The book (including software) is: International Academic Bookstore, 7500 W. Camp Wisdom Road, Dallas TX, 75236 U.S.A. (phone 214/709-2404, fax 214/709-2433)
Ftp Directory KIT-FAST is an experimental German -> English Machine translation system. It was developed in Quintus Prolog 3.1.4, under the Sun OS. The MT system is made available together with linguistic data for German and English (grammars and lexicons)). The Prolog source code is included. There is an on-line documentation system; there is also an installation and users manual in German. Within Kit-Fast there is a knowledge representation system called BACK, and tools for linguisitc data development called CPSG tools and CPSG parser (for CPS grammars). KIT-FAST was developed by Technical University of Berlin, Department of Software and Theoretical Computer Science, Berlin, Germany, under Professor Wilhelm Weisweber. More Info: here.
Ftp Directory This is a Japanese parser which detects either a dependency or a case structure. In the case of the dependency structure, a set of heuristic rules are used to detect the unique structure of a sentence. In the case of a case structure, the analysis is performed on a sample set of sentences found in the case-frame dictionary. The current case-frame dictionary contains less than 1,000 verbs. This parser makes use of gmake, gcc, and juman. The authors of KN-parser are Profs. Sadao Kurohashi and Makato Nagao, Kyoto University. More Info: here.
Ftp Directory 21word is a Korean word processor running on IBM PClones. According to the original poster, it supports VGA, SVGA(Trident and ET3000? only) and Hercules mono graphics. Warning never specify swap directory as one with some files in it since the program wipes out things in the directory specified as swap directory. It should be comparable to those commercially available. More Info: here.
Ftp Directory KTEXT is a text processing program that uses the PC-KIMMO parser (see info about PC-KIMMO). KTEXT reads a text from a disk file, parses each word, and writes the results to a new disk file. This new file is in the form of a structured text file where each word of the original text is represented as a database record composed of several fields. More Info: here.
Ftp Directory The Linguistic DataBase (LDB) is a database program created by the TOSCA corpus linguistics group at Nijmegen University for the storage and exploration of syntactically analysed texts. It features a tree viewer and an extensive query language. It was designed on the basis of the Nijmegen 130,000 word English corpus, also available. This MSDOS LDB demo program demonstrates some of the features of the Linguistic DataBase program. More Info: here.
Ftp Directory This quide was prepared by Dr. Bob Krovetz at the University of Massachusetts, Dept. of Computer Science. This is a guide to the errors which Dr. Krovetz found in the machine-readable version of the Longman Dictionary of Contemporary English (LDOCE). The guide refers to the first edition of the dictionary, and in particular the "lisp" version. The guide has a section on translation errors, errors which resulted from the conversion of the original tape into Lisp s-expressions. Another section has listings of errors in particular fields: part-of-speech, subject codes, selectional restrictions, definitions, and run-ons. The pronunciation field and the grammar codes are not examined, nor are errors identified with the box codes on morphology. More Info: here.
Ftp Directory The LHIP parser (Left-Head corner Island Parser) was developed by Afzal Ballim, at ISSCO, the University of Geneva. LHIP is a system for incremental grammar development using an extended DCG formalism. The system uses a robust island-based parsing method controlled by user-defined performance thresholds which allows it to analyse what it can from the input, thus presenting the grammar developer with results at an early stage. The rules themselves are an extended version of the DCG rules, allowing optional constituents, negation, disjunction, the specification of adjacency, and the ability to mark multiple heads in a rule body. The latest version is 1.1. The lhip system requires an Edingurgh style Prolog. More Info: here.
Ftp Directory This directory contains a system for parsing English. The parser is based on Link Grammar, a context-free formalism for the description of natural language also designed by LINK's authors (Daniel Sleator, Carnegie Mellon University, and Davy Temperley, Columbia University). It is a lexical system, where each word has a combinatorial formula representing all the ways in which that word can be correctly used in a sentence. A sentence is grammatical if links can be drawn above the words in such a way that (1) each word's combinatorial requirements are satisfied, (2) the links do not cross, and (3) the graph of links and words is connected. The system is comprised of a parser which reads in a link grammar (words and their corresponding formulas), and parses sentences according to the given grammar. This system also includes a Link Grammar for English. This grammar has roughly 700 definitions and 25000 words, and captures many phenomena of English grammar, such as noun-verb agreement, questions, imperatives, complex and irregular verbs, different types of nouns, past or present participles in noun phrases, commas, a variety of adjective types, prepositions, adverbs relative clauses, possessives, etc. More Info: here.
Ftp Directory LQ-TEXT searches text for phrases in it that you previously indexed. A browser and a program to generate keyword-in-context style lists are also included. The software is primarily designed for Unix systems. The necessary indexing program (lqaddfile) is enclosed. Indexes are usually less than the size of the data, and sometimes half that. There is a browser (lqtext) for System V, and a shell script (lq) for any Unix system. There is also a program (lqkwik) that turns the output of lqphrase or "lqword -l" into a keyword in context-style list. More Info: here. Male and Female Names List Ftp Directory A file of almost 3000 male names and 4967 female names compiled by Mark Kantrowitz. Copyright is contained in the names.README.Z file. More Info: here.
Ftp Directory MacLex is a program for field linguists that manages lexicon/dictionary files of a specified format. It supports editing, find/change, user- defined sorting order, and reversals. It is written and supported by Bruce Waters of SIL (Summer Institute of Linguistics). More Info: here.
Ftp Directory This is a dictionary system suitable for use in a natural language parsing system. In addition to the basic dictionary, a morphological analysis system can be used to cope with inflections and derivations of words in the lexicon. This system allows the user to write dictionaries and analyzers in the language of their choice, but the system includes an example of British English. This system was designed by Graeme Ritchie, Steve Pulman, Graham Russell and Alan Black More Info: here.
Ftp Directory MORFOGEN is a morphological rule compiler and dictionary interface tool which consists of a finite state compiler (that converts inflectional and derivational paradigms into a finite state machine) and a recognizer, which accepts inflected forms as input and returns base forms (constrained by inflection class information in the lexicon) as well as any morphemes that matched during analysis. MORFOGEN can handle concatenative as well as non-concatenative morphology, and can be customized, for use on languages of inflecting as well as agglutinating types. The demo program provides executables for Sun4 OS 4.1.1. More Info: here.
Ftp Directory This second version of the MRC Psycholinguistic Database is a computer usable dictionary (MRD) created from a large online database originally used in psycholinguistic research. The text comes complete with a suite of UNIX/C retrieval tools, but can be processed on any machine under any operating system. The file contains 150837 words and provides information about 26 different linguistic properties, although it is not the case that information about every property is available for every word. No semantic information is included. Linguistic properties include: number of letters, phonemes, and syllables; measures of frequency; pronunciation; measures of familiarity; part of speech; etc. The original compilation research was conducted by Professor Max Coltheart under a Medical research Council grant, and the resulting dictionary was documented and corrected by Mike Wilson. More Info: here.
Ftp Directory MTran is an MS project by Douglas Witmer (University of Texas at Arlington), that analyzes 5 different linguistic theories with regard to machine translation. The MTran program was written to implement a proposed a fixed text machine translation technique, arguing two primary points: much of machine translation work can be done by a person fluent in only one language, and analysis of text only needs to be done once during the translation of text into multiple languages. MTran is written in the ICON language and ICON interpreters for 8088 and 80386 architectures are provided. More Info: here.
Ftp Directory These files were gathered for the Fifth Message Understanding Conference participants, and later released to all CLR members. Files include a nationalities file of 216 countries with noun and adjective forms of the nationalities, and two files on organization names, one listing UN organizations and the other listing 187 international organizations. The BBN corporation provided corporation names in English and Japanese, Japanese human names and place names, and a lexica of Japanese words in the business domain. More Info: here.
Ftp Directory MY Russian translation program translates in a fully automatic mode or a semi-automatic mode which allows interactive correction for new words. It allows updating of the dictionaries; "teaching" the system new words and context sensitive translation; and editing of text while in the process of translating. The program was originally designed for college students. The My Russian translation program was written by Yuri Yulaev. More Info: here.
Ftp Directory NJStar is a Japanese word processor for PC's with a Wordperfect-like feel. It supports the input, display and printing of Japanese characters; JIS, EUC, and NEC-JIS. NJStar has pull-down menus, cut and paste, macros, built-in printing and drivers, multiple file editing, and configurable keys. Version 3.j is still offered as shareware, but please honor the copyright and pay for a registered version. NJStar was created by Hongbo Ni, and comes complete with extras like fonts and dictionaries. More Info: here.
Ftp Directory The NIST NCSL OSE SGML package is an SGML parser and validation suite for text into which tags in SGML format have been inserted. Currently no other documentation other than comments in the code is available. The C source code is supplied in this package, which is designed primarily for Unix systems. More Info: here.
Ftp Directory Oed2/Ox2 is a package of utilities used to manage network access to the Oxford English Dictionary at the Waterloo Center for OED research for online lookup of words, definitions, examples, or other patterns. [Oed2/Ox2 is a front-end to the Pat pattern searching applied to the Oxford English Dictionary Version 2. Version 2 of the OED comprises the merged Version 1 dictionary and the Supplement.] Pat combines very fast search capabilities over a very large text file with an awkward user interface. The oed2/ox2 program knows about the structure of the dictionary file. At the request of Oxford University Press, this program should be installed as ox2 (not oed2) at non-UIUC sites. Because oed2/ox2 is a network resource, the oed2/ox2 program can be compiled to use a remote Pat server. In addition, a modified telnet server is supplied for remote network access. More Info: here
Ftp Directory The On-line Dictionary of Computing is a dictionary of programming languages, architectures, operating systems, networking, theory, mathematics, telecoms, acronyms, jargon, projects, history, in fact anything to do with computing. It was compiled by Dennis Howe, from the Theory and Formal Methods section of the Department of Computing at Imperial College of Science, Technology and Medicine, London. More Info: here.
Oxford Advanced Learners Dictionary of Current English Ftp Directory The OALDCE is a well known dictionary of over 35,000 headwords. This version was prepared and documented by Roger Mitton of Birkbeck College, University of London, and is often referred to as the "computer-usable" version. The dictionary contains no definitions; the spelling, pronunciation, and syntactic information of the original are used. In addition to the headwords and subentries of the original, this version is extended to include about 2,500 proper names, and a section of over 68,000 derived inflected forms. More Info: here.
Pan American Health Organization Ftp Directory The Pan American Health Organization (PAHO), Conferences and General Services Division, has kindly allowed this group of sample parallel texts to be released for nlp research purposes. There are 180 pairs of text, 360 individual files, which amount to about 8 Mb of data. The documents cover the general domains of Public Health and Latin America, but vary greatly in content and in length. Some are short memos or letters, most are longer reports and conference proceedings. The Spanish documents do contain the Spanish character encoding. Other formatting commands, such as tabs, centering, italicizing, etc. have been removed. Special thanks to Dr. Marjorie Leon for her assistance in making these texts available.
Ftp Directory PC-KIMMO is a new implementation for microcomputers of a program dubbed KIMMO after its inventor Kimmo Koskenniemi (see Koskenniemi 1983). It is of interest to computational linguists, descriptive linguists, and those developing natural language processing systems. The program is designed to generate (produce) and/or recognize (parse) words using a two-level model of word structure, in which a word is represented as a correspondence between its lexical-level form (components) and its surface-level form (the way it is written). More Info: here
Ftp Directory The Pleuk grammar, published by the University of Edinburgh's Centre for Cognitive Science, is a shell for grammar development. Many differnt grammatical formalisms can be embedded within it, including Cfg, HPSG-PL, Mike, SLE, and Term. Sample grammars are provided for these formalisms. Pleuk is being made available by its authors in the hope that it will provide a set of facilities for the production of new grammar formalisms. Pleuk provides both an operating system in which to develop grammars as well as a uniform environment in which grammar writers and testers can situate implementations of a variety of grammar formalisms. The system offers a way of describing the modules of a grammar formalism and defining operations over files and the objects described in them, such as compilation, editing and display. Pleuk uses a standard Prolog environment and takes advantage of functionalities like high quality graphs or menu-based input when these are available. More Info: here.
Ftp Directory PM-TeX provides a simple macro-based approach to typeset Chinese, Japanese, and Korean text using either LaTeX or TeX. It comes with programs that generate MetaFont fonts for these scripts from existing bitmap fonts. A number of conversion programs to convert the Chinese, Japanese, or Korean text into a form that PM-TeX can use are included. PM-TeX works with almost any (La)TeX system on almost any platform. More Info: here.
Ftp Directory These are some of the libraries and programs from the DEC-10 Prolog Library as well as some other interesting Prolog programs. The list is long, so please consult the info page. More Info: here.
Ftp Directory This is a program (designed by Archibald Michiels and Jacques Noel, Universite de Liege) that reads the Collins CoBuild English Language Dictionary (Harper-Collins and renders it so that its information is more accessible. The information that is kept includes: the lexeme with the reading number, the headword with morphological variants, grammar information, the definition, and examples. The result is a list of awk records, each separated by a blank line. More Info: here.
Ftp Directory This is a list of around 114,000 words taken from the 170 million word Cobuild "Bank of English" Corpus. The list was created from words which appear predominantly in uppercase, and are therefore good candidates for proper nouns. The file has three columns: the word, its frequency with initial capital letter, and its frequency with lowercase initial letter. The files were prepared by Jem Clear, and are freely available except for commercial uses. More Info: here.
Ftp Directory This thesaurus is an electronic version of the edition of Roget's classic Thesaurus published in 1911 by the Crowell company. The large number of English words is subdivided by sense into a series of large antonym groups and into subgroups, indexed by subjects whose tags appear in the outline. (Alphabetical indexing is not provided as it remains under copyright; most of the original printer's effects such as italics for borrowings appear.) More Info: here.
Ftp Directory Rook is a system for authoring descriptive grammars in HyperCard for the Macintosh. It is a tool for interactively and incrementally developing a grammar description based on an interlinear text corpus (as produced by SIL's IT program). The resulting on-line descriptive grammar exploits the capacity of the computer to provide instant access to cross-referenced topics, text examples, explanations of morpheme glosses, and so on. It was designed by J. Randolph Valentine. More Info: here.
Ftp Directory Russian English On-Line Dictionary is a memory-resident program for the MS DOS operating system developed by Leon Ungier. Version 1.25 is freely available. You can not add items to the main dictionary, but a Personal Dictionary Manager allows the creation of your own dictionary files. More Info: here.
Ftp Directory SAX (Sequential Analyzer for syntaX and semantics) is a syntactic analyzer based on logic programming. SAX employs a bottom-up and breadth-first parsing algorithm. The SAX grammar rules are basically written in Definite Clause Grammar (DCG). The SAX grammar rules are translated into a parsing program written in Prolog. SAX is implemented in SICStus Prolog Ver 0.7. Included with this system is a Japanese grammar and some sample Japanese data. More Info: here.
Ftp Directory These files were made available by Rebecca Bruce, at New Mexicso State University, from work done by herself and Dr. Jan Weibe. The data file is composed of sentences containing the noun "interest" or "interests" that were automatically extracted from the Penn Treebank Wall Street Journal corpus. The file includes the part-of-speech tags and phrase bracketing provided in the original corpus. Each sentence in the data file contains one sense-tagged occurrence of the word "interest" (or "interests"). The sense tags correspond to the six non-idiomatic noun senses of "interest" defined in the first edition of Longman's Dictionary of Contemporary English. In total, there are 2,369 sentences. For More Info: here.
Ftp Directory The sgml2latex system is a set of SGML document type definitions for the LaTeX document styles (articles, books, reports, letters, slides), for BibTeX bibliographies and for Unix manual pages, a set of programs for doing the translation from SGML to LaTeX or troff/nroff, and a program for extracting source code from documentation, providing a simple "literate programming" facility. To use the 'qwertz' documentation system, the 'sgmls' SGML parser is required (also available from the CLR). More Info: here
Ftp Directory Sgmls is an SGML parser derived from the ARCSGML parser materials which were written by Charles Goldfarb. It works on Unix, MS-DOS and VAX/VMS. It should be straightforward to port to most systems that provide ANSI C and use an ASCII-based character set. It outputs a simple, easily parsed, line oriented, ASCII representation of an SGML document's Element Structure Information Set (see pp 588-593 of ``The SGML Handbook''). It is intended to be used as the front end for structure-controlled SGML applications. For compatibility with the Amsterdam SGML Parser (ASP), there is also a filter that translates the output of sgmls using an ASP replacement file. More Info: here.
Ftp Directory This is a Spanish-English translation system demo written by John L. Beaven, at the University of Edinburgh. Other contributors are Guy Barry, Robin Cooper, Mark Johnson, and Chris Mellish. The system exploits recent advances in lexicalist unification-based grammar theories. The system provides greater modularity of the monolingual components. The approach is demonstrated by presenting very different Unification Categorial Grammars for small fragments of English and Spanish; the grammars contain linguistically interesting phenomena such as word order variation and clitic placement. The monolingual grammars are put into correspondence by means of a bilingual lexicon. More Info: here.
Ftp Directory SHOEBOX is a database management program, designed expressly to meet the needs of the field linguist. Using SHOEBOX, the linguist can easily enter, edit, and analyze lexical, textual, anthropological and other types of data in multiple datafiles. For example, with SHOEBOX, + Maintain a simple dictionary, or a more complex lexicon, + Interlinearize text, where new words are automatically entered into the dictionary, + Do grammatical filing and analysis of text data, + Enter and file cultural notes, + Maintain nonlinguistic types of databases, such as address lists or library catalogs. More Info: here.
Ftp Directory The SICM is a manual which defines the classification of economic activities for the production of Federal economic statistics. Economic activities are classed under 99 major headings, each with a potential 99 subheadings. Thus manufacture of bulletproof vests is in class 3842 -- Orthopedic, Prosthetic, and Surgical Appliances and Supplies which is in group 384 -- Surgical, Medical, and Dental Instruments and Supplies which, in turn, is in major group 38 -- Measuring, Analyzing, and Controlling Instruments; Photography, Medical and Optical Goods; Watches and Clocks. The manual is used to provide classifications for a variety of computer applications. For example companies maintaining mailing lists may classify the organizations on the list by SIC code and use this to target or specialize the type of mail sent to the organization. The manual is available in machine readable form and is an interesting lexical resource in its own right. More Info: here.
Ftp Directory The SUSANNE Corpus comprises an approximately 128000-word subset of the Brown Corpus of American English, annotated in accordance with the SUSANNE scheme. The SUSANNE scheme attempts to provide a method of representing all aspects of English grammar which are sufficiently definite to be susceptible of formal annotation, with the categories and boundaries between categories specified in sufficient detail that, ideally, two analysts independently annotating the same text and referring to the same scheme must produce the same structural analysis. More Info: here.
Ftp Directory Names are a key entry point for researching and indexing historical information. However, the format and spelling of personal names vary greatly from one institution to the next, reflecting traditional differences in practice. If historical information is to be compiled or shared, matching different versions of personal names is a necessity. The computer program SYNONAME automatically matches many possible forms of a single personal name by using an ordered sequence of twelve algorithms for pattern matching that include both character- and word-matching techniques. The matched pairs of names are considered to be "candidate matches" until confirmed by a human name-authority editor. Run against a merged file of artists' names from museum collections data, the program performed with an accuracy rate of 97.4% and an optimum efficiency rate of 90.8%. Accuracy can increase to nearly 99% at the expense of some efficiency. The concepts behind the algorithms and their imlpementation may be useful to others merging data in different contexts. More Info: here.
Ftp Directory SYNTACTICA is a software application tool designed for use in introductory syntax classes, or introductory linguistics classes with a syntax component. SYNTACTICA presents a simple graphical interface for creating grammars and for viewing and transforming the structures that they assign to natural language sentences. Using SYNTACTICA, it is possible to construct a grammar consisting of a set of context-free phrase structure rules and (typically) a lexicon. This grammar is loaded into a TreeViewer window, which generates phrase-markers for input sentences on the basis of the grammar that has been loaded. SYNTACTICA is a production of the SUNY-Stony Brook Semantics Lab, and was developed by Richard Larson under a grant from The National Science Foundation. It runs on the NextStep operating system. More Info: here.
Ftp Directory TACT is an interactive full-text retrieval system for MS-DOS with a number of analytical tools. Like others of its kind, TACT retrieves segments of text according to specified word forms. In addition, it can find words or character-strings that match criteria the user specifies. TACT generates simple graphs to show the distribution of forms throughout an entire text, or within various structural divisions determined by the user. TACT also allows retrieval by metatextual `categories'. TACT was designed by John Bradley and Lidio Presutti at the University of Toronto. More Info: here.
TAGGER.v1.12 Ftp Directory This is a part-of-speech tagger designed by Eric Brill at MIT. The tagger can be trained to tag, or an already trained tagger for English can be used. The trainer uses a two-stage process: in the first stage the tagger learns rules for tagging unknown words; in the second, the rules learned involve the use of contextual cues to improve tagging accuracy. The trained tagger also uses a two-stage process. It assigns most likely tags to every word in isolation. In the second stage, contextual transformations are used to improve accuracy. More Info: here. This tagger has been updated to version 1.12 and is found in the same ftp directory as is its original version.
Ftp Directory PCTAMIL is a transliteration tool cum previewer which transliterates tamil documents written in Roman text into Tamil script; it is compatible with LaTex. Included is a font driver that produces the bilingual ASCII setup for both English and Tamil fonts. PULAVAN- Tamil Pundit - is a set of programs for learning Tamil and includes a program for verb conjugations in Tamil. TAGTAMIL is a part of speech tagger; it is a morphological processor which can output the root form of an inputted word and provide suitable tags for affixes. The dictionary has a lexicon of 1000 words at this time. Programs developed by Vasu Renganathan at the University of Washington. More info: here.
Ftp Directory TIMIT is a database of 6,100 English words with their most likely pronunciation. Each entry is made up of two lines: the first line has the word number followed by the spelling of the word, and the second contains the transcription of the word using the set of 61 TIMIT phones. This word list was provided by Chuck Wooters from ICSI at Berkeley. More Info: here.
Ftp Directory This is a two-level phonological analyzer based on the system described in S.G. Pulman and M.R. Hepple, "A Feature Based Formalism for Two Level Phonology: a description and implementation." More Info: here.
Ftp Directory These texts were kindly given to CLR by Dr. Kemal Oflazer of Bilkent University; some of them will be available in the European Corpus Initiative project's CD. One file contains a news feed from the Anatolian News Agency from September of 1992. The other file has miscellanious pieces from popular publications. More Info: here.
Ftp Directory UBS is a formal language, which allows users to specify HPSG grammars. It is an extension of SEPIA Prolog that was developed to accomodate those aspects of the grammar formalism HPSG which can not be implemented in regular Prolog in a straightforward manner. UBS is able to process typed feature structures over type hierarchy trees, unification, disjunction, negation, functional dependent values (relations) and sets (unification of sets). The parser/generator includes as data the HPSG grammar for English. UBS is by Frieder Stolzenburg from the Universitaet Koblenz-Landau. More info: here.
Ftp Directory lexica/UMLS/samples The Unified Medical Language System project is sponsored by the National Library of Medicine at the Department of Health and Human Services. UMLS has developed two machine readable "Knowledge Sources" of medical lexicons; a Metathesaurus, and a Semantic Network. The CLR archive does not house the data and software (currently over 500 Mb of information); rather this directory contains descriptive fact sheets on the project, sample files, complete copies of the program documentation, and instructions for the license agreement letter. The entire database is still available for no charge at this time. The info file contains a few citations by linguists who have worked with the data. More Info: here.
Ftp Directory The morphology package contains a large morphological database, an X-window based maintenance program, and C and Lisp hooks for interfacing the database to other software programs. The database itself consists of approximately 316,000 inflected items, along with their root forms and inflectional information (such as case, number, mode). There are 13 parts of speech - Noun, Proper Noun, Pronoun, Verb, Verb Particle, Adverb, Adjective, Preposition, Complementizer, Determiner, Conjunction and Interjection, and Noun/Verb Contraction. Nouns and Verbs are the largest categories, with approximately 213,000 and 46,500 inflected forms, respectively. The access time for a given inflected entry is .6 msec. The maintenance program runs under the X-window interface, and allows the user to customize the database to their needs. A inexperienced user can easily add, delete, or modify entries to the existing database, and a person passingly familiar with X windows and C array structures can customize the package for a different language, building their own database with different parts of speech and/or inflectional information. There are C and Lisp functions that provide hooks to allow developers to incorporate the database into existing research projects. The entire package requires about 25-30 M of space. More Info: here.
Ftp Directory Verbalist is a program to demonstrate English verb forms, written by John and Muriel Higgins. It is an MS DOS, or MS Windows, application which conjugates English verbs. It contains a dictionary with all the irregular verbs in English, and a sampling (300) of the other verbs. The dictionary is extensible and the verb forms covered are extensive. More Info:/VERBALIST
Ftp Directory These software packages are produced by the Vietnamese Professionals Society. VPSedit is a Vietnamese text editor for the PC, and VPSwin is the Windows version. VPSwin has a spell checker and a hoi/nga lookup table. There are also three font files included; VPSfont1 is a True Type font set for Windows with 8 True Type fonts. More Info: here.
Ftp Directory This directory includes a variety of tools and information for processing Vietnamese text. The tools are fonts, a text converter, and a text printer. A proposal for the Vietnamese Standard for Information Interchange is also included. More Info: here.
Ftp Directory This program is authored and copyright ptotected by Ari Hovila and Jari Perkimki, University of Vassa. WLIST is a language independent word length and word frequency counter. WLIST is a statistical tool for any language user. The program can recognize all words in an ASCII file as well as count their occurrences. WLIST counts the lengths of all unique words as well as the average lengths of all words. WLIST is language independent. Sorting is determined by the alphabet of the language you are working in. Sample sorting files are included for Finnish, Swedish, Norwegian, Danish, English, French and German. The user can create their own as needed. WLIST was compiled to run under DOS. More Info: here.
Ftp Directory These word lists are copies of a number of word lists that are freely available from a number of sites in Europe and North America. The origins of some of them are currently unknown, but are being checked. Current languages are English, Dutch, English (shorter list), German, Norwegian, Italian, and Swedish. Several word lists containing names are also available. More Info: here.
Ftp Directory The Oxford Text Archive word lists contains word lists from the following. More Info: here.Australian Chinese (only a list of the HanYu PinYin) Computer (various stuff including common passwords, domains, etc.) Danish Dutch Finnish French German Italian Japanese (List of words in Romaji - see edictj reference in here ). Literature (including various authors and genre) Movies&TV (including Monty Python and Star Trek word lists) Names (includes names in a number of languages and others) Norwegian Place Names (including colleges, wordl factbook, zip codes, etc.) Random (includes various random sorts of word lists) Religion (includes Q'ran and King James Bible word lists) Science (includes asteroids and biology lists) Spanish Swedish Yiddish WORDSURV Ftp Directory A typical language survey may involve activities like determining linguistic relationships through the comparison of word lists, testing dialect intelligibility by playing back tape-recorded texts, and studying sociolinguistic aspects of language use and language attitudes in multilingual situations. WORDSURV is designed to aid the first of these areas--the collection and analysis of word lists. It functions in three main areas: (1) data entry and maintenance, (2) data analysis, and (3) data output. WORDSURV also supports specialized kinds of analysis, including lexicostatistics, phonostatistics, and comparative reconstruction. More Info: here.
Ftp Directory This part-of-speech tagger, designed by Doug Cutting and Jan Pederson at Xerox, was written in ANSI Common Lisp. Its development was done in Franz Allegro Common Lisp version 4.1 on SunOS4.x and MacIntosh Common Lisp 2.0p2. The following code is provided: source code, a tokenizer for plain ASCII English, an English lexicon enduced from the Brown corpus, a table of mappings for word suffixes to likely ambiguity classes, and an HMM trained on the odd numbered sentences in the Brown corpus. More Info: here.
Ftp Directory This is a Serbo-Croatian corpus consisting of approximately 700, 000 words. The texts are taken from modern Yugoslav fiction and all Serbo-Croatian-speaking areas-- Serbia, Croatia, Montenegro, and Bosnia-Hercegovina-- are represented. The texts are in ASCII format and the Latin alphabet is used. A list of ASCII values of the special Serbo-Croatian characters is provided in info file 0092. More Info: here.