[CRL Home][CRL Research]

URSA

Unicode Retrieval System Architecture

Overview

URSA is Computing Research Lab's Tipster Phase III research and development effort to make text processing and information retrieval transparent to languages.

URSA combines unicode display technology developed at CRL with translingual information retrieval, multilingual collection visualization and document management, with special emphasis on design principles that have been validated by examining the analyst in real-world scenarios.

Tipster Phase III

Tipster is DARPA's design and research effort to unify the detection and language processing capabilities of a diverse range of research entities into a single, plug-and-play architecture. The Tipster Document Manager is the central feature of the Tipster design. In Tipster, there are collections, documents, attributes and annotations. Collections can contain documents and collection attributes, while documents can have annotations and attributes. Other Tipster features include support for detection and information extraction technologies.

Because the data in a Tipster document is simply a byte stream, a document can contain Unicode text, video, audio, or any other imaginable data type with equal facility. It is up to the application that makes use of the documents to interpret the data correctly.

Unicode and URSA

URSA combines the latest advances in information retrieval with the coherent Unicode text model to make language-transparent IR a reality. Ongoing development is focusing on integrating Unicode detection technology with the Tipster architecture. The model we are developing utilizes annotations on the documents to describe the text for indexing. External document annotators can produce segmentations for Oriental languages, or stemmed word markup for Western languages, which are then interpreted by the URSA engine and indexed for later retrieval. The URSA engine will be fully conversant in Tipster detection needs, including complex query expressions and natural language queries.

Papers and Presentations

Check-out some of our papers or download presentation slides on Cross-language Text Retrieval, URSA and related efforts:

Download the FREE Adobe Acrobat Reader for PDF files here

  • [postscript] [pdf] Tipster III Kickoff Meeting Presentation (October 1996, Columbia, MD)
  • [postscript] [pdf] Tipster III 6-month Meeting Presentation (May 1997, Columbia, MD)
  • [postscript] [pdf] Tipster III 12-month Meeting Presentation (October 1997, La Jolla, CA)
  • [postscript] [pdf] Trec 5 Paper on Cross-Language Text Retrieval (TREC 5, Gaithersburg, MD, November 1996)
  • [postscript] [pdf] Trec 5 Slides on Cross-Language Text Retrieval (TREC 5, Gaithersburg, MD, November 1996)
  • [postscript] [pdf] Cross-Language Text Retrieval using Evolutionary Optimization (EP95 in San Diego)
  • [postscript] [pdf] A Follow-up Paper on Cross-Language Retrieval Using Evolutionary Optimization (EP96 in San Diego)
  • [postscript] [pdf] AAAI Workshop on Cross-Language Text Retrieval Paper (Stanford University, March 1997)
  • [postscript] [pdf] Paper presented at SIGIR96 Workshop on Cross-linguistic Information Retrieval (ETH, Zurich 1996)
  • [postscript] [pdf] Trec 4 Paper on Cross-Linguistic Text Retrieval (TREC 4, Gaithersburg, MD, November 1995)
  • [postscript] [pdf] Trec 6 Paper on Cross-Language Text Retrieval (TREC 6, Gaithersburg, MD, November 1997)
  • [postscript] [pdf] SIGIR 97 Paper on Implementing Large-Scale Cross-Language Text Retrieval Systems (SIGIR97, Philadelphia, PA, August, 1997)
  • [postscript] [pdf] SIGIR 97 Workshop Paper on Monolingual, Multilingual and Crosslingual Information Retrieval using network models (SIGIR97, Philadelphia, PA, August, 1997)
  • [postscript] plus a figure [postscript] OR [pdf] plus a figure [pdf] Early Tech Report on iteratively least-squares fitting of language translation models.
  • [pdf] Slides from Internal CRL Seminar, containing an overview of text retrieval and some notes on QUILT
  • [power point] [pdf] Unicode Conference Presentation (with notes). 14th International Unicode Conference, Boston MA March 1999.
  • [postscript] [pdf] Extended tech report on using URSA libraries and tools. You can download the J24 development archive described in the paper if you know the secret password. Warning: it is around 10Mb compressed and 40 Mb uncompressed.
  • [zip] [html] Power Point presentation (May 1999). Reviews work on Cross Language Language Text Retrival and Interactive IR and how we have combined these results in the design of an Interactive CLTR interface, KEIZAI.
  • New

  • [doc] [pdf] Keizai: An Interactive Cross-Language Text Retrieval System
    Paper for the Workshop on Machine Translation for Cross Language Information Retrieval which was held in conjuction with the MACHINE TRANSLATION SUMMIT VII September 13-17, 1999, Singapore.
  • [doc] [pdf] Improving Cross-Language Text Retrieval with Human Interactions
    Paper presented at The Hawaii International Conference on System Sciences HICSS-33 January 4-7, 2000.

Demo
The MUNDIAL system was an early proof of concept for Translingual or Cross-Language Text Retrieval. The new MUNDIAL system is substantially more robust and now provides an interface to more search engines. MUNDIAL allows searches of documents on the World Wide Web using English language queries that are translated into ten other languages.

Click here to try the MUNDIAL demo.

This version of MUNDIAL can translate the returned document (Spanish to English only) from the search engine. Due to browser security limitations, it can't translate any document, however.

Click here to try J24 The URSA retrieval engine is now being used in the J24 retrieval system for evaluating document visualization in interactive retrieval. J24 presents thumbnail views of documents to show keyterm distributions.

Click here to try Arctos The URSA retrieval engine is now being used in conjunction with URSA's cross-language technology for the interactive creation of queries, multilingual document retrieval and translation. The queries can be sent to WWW search engines or to J24 for interactive document visualization and translation.

Click here to try Keizai This is the next iteration of a WEB based interface for Cross-language text retreival. It helps you construct text queries in French, Italian, German, Spanish, Chinese, Japanese and Korean. Named entitiy extraction in these languages and translation capabilities are used to present the text results

Contacts
For more information on the URSA project, Unicode detection and translingual information retrieval or user-centered detection system design, please contact principle investigators Bill Ogden or Mark Davis.