New OED Software

The main components of the text management software developed as part of the New OED project are the mature products of years of research [publications]. In recognition of the significance of the record, NSERC included a description of the project in its 1991 publication Great Canadian Success Stories. While the research continues at the University of Waterloo, the tangible results are further developed under licence by Open Text Corporation, a spin-off company located in Waterloo; commercial applications hold promise for countless new users.

Database design

Early in the project, researchers at Waterloo recognized that existing database models were unsuitable for the efficient manipulation of the structures found in reference texts. Accordingly, we developed GOEDEL, a programming language that supports models based on grammars.

GOEDEL's first commercial application has been in the New Shorter Oxford English Dictionary project at Oxford [Blake, TOIS, July 1992]. Because it automated the initial drafting of entries by drawing on information in the OED2 database, our software has dramatically increased productivity.

Ongoing research is examining how relational database management systems, and especially SQL, can be extended to manage structured text effectively.

Text Transduction

Those who process text with computers repeatedly face the problem of transforming their data. For example, in the New OED project, it was impractical to enter text in a form immediately suitable for editing.

At the University of Waterloo, researchers devised a set of tools to replace conventional methods, which were typically expensive, time-consuming and prone to error. The software is based on INR, a program developed by J.H.Johnson to convert rational databases to finite automata, and gasim, an interpreter for finite automata developed by F.W.Tompa. By allowing programmers to convert text according to grammars rather than individualized processing instructions, this system has simplified the restructuring of data for the New OED project and found several other applications at Oxford, Waterloo, and elsewhere. A commercial version was made available from Open Text, under the product name TTK.

Text Searching

The size of the OED2, about 60 million running words, makes it difficult to search by traditional means. Therefore, one of the most important pieces of Waterloo software is a text-search system, PAT, developed by G.H.Gonnet and T.W.Snider. PAT locates words, prefixes and phrases with equal speed and provides facilities to restrict searches to arbitrary regions of text. In fact, PAT can search the entire text of the OED2 in less than three seconds.

PAT has proven invaluable at both Waterloo and Oxford on such important texts as the OED2 and the Bible, as well as bibliographical data and federal legislation. PAT was marketed by Open Text under the name TextSearch and formed the engine behind the Open Text Web Index. Its use in humanities research and commercial applications was world wide.

Text Display

At Waterloo, researchers have combined traditional typography with leading-edge ``windowing'' technology in LECTOR, a tool for displaying text on the screen in formats tailored to a user's needs. LECTOR was designed by D.R.Raymond to provide customizable style sheets for displaying arbitrarily tagged text as well as supporting interactivity through user-driven selection of text locations via mouse picks. The software was available from Open Text under the name TextView.

With LECTOR and PAT windows together on one screen, researchers can quickly find and display text from any texts, including various dictionaries and other resources.