Structured Text Database Systems proposal

Structured Text Database Systems

Frank Tompa

(Citations such as [J2], [C4], [T1], and [P7] refer to my own publications, which were listed as entries on an accompanying Personal Data Form, not included here.)

Motivation and Objectives

In text-dominated database applications the content is primarily text rather than numbers, but the fundamental principles of data organization remain. Our research follows the "database approach," specifically those aspects that address data structuring together with language and algorithm support (as opposed to the aspects concerned with "database operating systems").

One long-term objective is to improve document storage and management by adapting sound database practices to this domain of information and by designing and prototyping suitable document database management systems. The second, complementary objective is to create systems that manage information fragments that are assembled in diverse ways to meet various applications’ needs by providing virtual documents.

In the short term, aspects of these objectives will be addressed through three themes:

development of data models for structured text,
design and implementation of text transformation systems for loading structured text databases and for supporting views, and
design and implementation of improved algorithms and storage structures for structured text searching and browsing.

Progress to Date

Because the fundamental assumptions about the nature of text data and its uses distinguish text databases from conventional ones [B1], we introduced the notion of a grammar-defined model of data [Gonnet87]. The model uses grammars as schemas and "parsed strings" (p-strings) as instances. Operators on these instances have been defined, resulting in a "p-string algebra" that can be used for data manipulation and view definition. An interpreter for the model’s algebra was implemented and successfully applied to solve the problem of creating an initial draft of the Shorter OED from the text of the OED [J3]. A second implementation of the algebra’s operators was completed as part of a type library for C (including also sets, lists, and tables). Surprisingly, our implementations support highly structured text, but they have not embodied grammars and, in fact, provide no data definition language. On the other hand, SGML defines only uninterpreted grammars and therefore also does not meet major conditions of a data model [J1]. As a result, there remains the need to develop a text data model that effectively utilizes grammars.

A major problem in text manipulation is in the area of transduction: the conversion of text from one form to another. This is useful both for converting from data capture form to some stored form and in mapping text from internal form to the form requested by an application. Early on in the OED project, we addressed the problem of converting entered text to a more convenient tagged form [Kazman86]. The tools used were based on interpreting finite state machines, and we continued to explore this approach [Clark91]. Since the original transduction of the OED, we have successfully used the tools for converting several other texts: converting typesetting tapes for the Oxford Advanced Learner’s Dictionary and the Australian National Dictionary into forms based on descriptive tagging. Our experience has indicated that the specification of transductions is extremely difficult in the presence of high variability in the input form (as is common when data capture is not tightly constrained).

The second aspect of text transduction is in support of database views. Here the problem is not the specification of mappings to the view, since the text is typically well-structured after it is stored in the database, but rather it is in specifying update semantics. We had addressed this problem in the relational context from two sides: updating through views [Medeiros86] and updating of views [Blakeley86, Tompa88]. Because text is typically updated via a text editor that is outside the control of a database system, conventional solutions to managing updates cannot be applied. We therefore designed an alternative approach that allows a view of some text to be checked out of the system and updated, then later checked back into the system which will then apply view update semantics to effect the changes [R2]. We must now implement the algorithms in a text management system to develop them further and prove their worth.

We also investigated systems to support hypertext browsing. We started by designing effective videotex databases [Tompa81, Schabas83] and examining the implications for hypertext if it were applied to the OED [Raymond88]. Based on this experience, we concluded that hypertexts should not depend on statically-defined pages and links only, but instead must support views. We developed a hypertext database model along these lines [Tompa89] and later implemented an even more dynamic hypertext facility [C3] that resolves hypertext links by using full-text search [J2] to match source "keys." I continue to be interested in the problems of efficient, flexible text retrieval.

Brief Literature Review

As well as publications in conventional database areas, potentially relevant research is reported within the realms of digital libraries, electronic publishing, hypertext, information retrieval, and (office) information systems. Several researchers have defined alternative algebras capable of manipulating highly structured text without dependence on grammars [Clarke95, Consens95, Navarro95]. Others have examined grammar-based text data models [Böhm94, Gyssens89, Macleod91]. Instead of relying on explicit grammars to serve as constraints within a schema, still others have used more traditional database modelling mechanisms to manage text [Christophides94, Sacks-Davis95]. Because grammars for text can serve roles similar to that of database schemas, I will focus research on grammar-based approaches.

Researchers have attempted text conversion using tools other than finite state transducers, including attribute grammars and conventional compiler technology [Mamrak89], natural language parsers and logic grammars interpreted by Prolog [Neff88], and syntax-directed transformations, where each transformation involves two grammars and there is an algorithm that transforms a parse tree for the input grammar into a parse tree for the output grammar [Cordy91, Furuta88, Kuikka94, Pratt71]. Frameworks in which to couch these approaches are derivable from conventional database conversion techniques (e.g., [Fry81]). Our goal is to find efficient methods to carry out the transduction across very large text databases as well as on large user-specified subsets of the databases.

If the conversion process is not specified (at least in part) by the output grammar, a grammar for the resulting text must be produced. If parse trees are directly manipulated to effect a transformation, the output grammar may be determined from the input grammar and from the operations. If this is not possible, or if the text is initially loaded in tagged form but without an accompanying grammar, then a suitable grammar must be found. Work on grammar inference has been the subject of learning theory for several decades [Vidal94], with primary applications in pattern recognition. Typical assumptions adopted to make the problem tractable do not apply to deriving useful grammars to constrain text databases. In a recent paper [Ahonen94], automata that describe a text collection are generalized, converted to regular expressions, and then formed into a grammar. An alternative promising approach is detailed below.

Proposed Research Program

I propose to investigate how to exploit grammars in text databases. This will involve further work in relating SGML’s document type definitions to database definition languages, exploring the applicability of grammars as models for information in text, and using grammars to describe text database views, queries and updates. As examples of this research I present two specific problems.

Creating grammars to match text

In our work to date, database instances have not been constrained by grammars, since the data was obtained from existing texts created according to unenforced guidelines. It is our belief that such texts will continue to be important, and that recognizing structural components will remain a problem. Thus there remain both the need to analyze text created under loosely constrained conditions (in order to identify, and perhaps tag, grammatically structured units) and the need to develop concise grammars that best describe such texts.

In all transduction attempts so far, the construction of grammars that completely describe the variety of input found on typesetting tapes has been extremely difficult. I propose to investigate convenient methods for describing the restructuring of text more generally and to carry out the transformations efficiently. Our experiences to date show that a modular approach to describing transformations will be most effective. I plan to explore techniques to describe a sequence of transformations to be applied to the text (or to designated parts of the text): each transformation can be written and tested as a database query, and productive transformations can be maintained as view definitions. In this way, powerful database query facilities can be applied to discover structural properties of the data. Thereafter I will investigate methods to apply resulting sequences efficiently, exploring how to generate finite state transducers that effect a sequence of transformations described by a nest of views.

As a related problem, tagged texts will exist without grammars or with grammars that are so loose that they do not describe the instances very closely. Thus to support grammar-defined databases, we have begun to develop algorithms to infer meaningful grammars from SGML-like tagged instances. Rather than working at the level of automata, which produce regular expressions that are typically hard to understand, we refine the right sides of grammatical productions directly. The inference engine implemented is based on the equivalence of productions (such as S ::= ABC|AC == S ::= AB?C) and a few user-supplied heuristics (such as "replace S ::= ABBC by S ::= AB⁺C"). In preliminary results, we generated a grammar for the OED in which the number of alternatives to describe the uppermost structure of an entry in the OED was reduced from an unmanageable 1105 to 137, which is well within the limits of study by lexicographers. We intend to evaluate the effectiveness of this approach and of various heuristics for diverse texts.

Grammars as constraints

One important use of grammars is to constrain the interpretation of updates to documents through structured text views. Extending work begun in Finland [Nikunen90], we have begun to develop a generalized approach to updates through views particularly suited to text [R2].

In defining a view, we split the information in a database into three parts: the view, its complement [Bancilhon81], and a connector between them. Based on this separation we build a complete architecture which incorporates grammar-based text transformations. We further propose typical algorithmic approaches suitable for resolving ambiguous updates (e.g., to choose an interpretation that minimizes the number of deletions). Although the architecture was motivated by textual databases, the solution is applicable to relational or object oriented databases as well as to the management of updates for federated databases. Further investigation is required to establish the appropriateness of this approach for managing typical update scenarios arising in practice.

Significance of the Research

The research proposed here addresses text storage, retrieval, and manipulation. The most obvious applications arise within publishing (creation, editing, and distribution of documents — using paper or computer screens as access and delivery media), but applications are ubiquitous in business (involving reports, memoranda, internal messages, correspondence, operations and procedures manuals, advertisement, and so forth) and in software development and maintenance (involving specifications, source code, documentation, bug reports, and so forth). The rapidly growing reliance on an "information highway" for in-house as well as public dissemination of information, whether or not dominated by the World Wide Web, emphasizes the importance of this research.

We have found it extremely effective to approach text database research by seeking real applications’ needs. For example, we worked closely with lexicographers and humanist scholars to understand requirements for the OED [I3], we worked closely with software developers to understand requirements for maintaining software documents [I2], and we are working closely with industrial partners to understand commercial applications for SGML and for text management via the World Wide Web [I1]. The technology transfer demonstrated by the partnerships with the Oxford University Press and with the Canadian Strategic Software Consortium (CSSC) indicates the value of our research.

Scope of effort

I expect the research described in this proposal to be addressed over a five-year period. I intend to be assisted by two additional masters and two additional doctoral students working in suitable areas detailed in the proposal (e.g., grammar generation and update through views).

References

[Ahonen94] H Ahonen, H. Mannila, and E. Nikunen, "Forming grammars for structured diocuments: an application of grammatical inference," Proc. 2nd Int. Colloq. on Grammatical Inference and Applications (IGCI-94), Spain, Lecture Notes in Computer Science #862 (R. C. Carrusco and J. Oncina, eds.) Springer Verlag (Sept. 1994) 153-167.

[Bancilhon81] F. Bancilhon and N. Spyratos, "Update Semantics of Relational Views," ACM Trans. on Database Systems, 6, 4 (1981) 557-575.

[Böhm94] K. Böhm, K. Aberer, and E. Neuhold, "Administering structured documents in digital libraries," Digital Libraries (DL '94), Newark, Lecture Notes in Computer Science #916 (N. Adam, B. Bhargava, and Y. Yesha, eds.) Springer Verlag (1994), 91-118.

[Christophides94] V. Christophides, S. Abitoul, S. Cluet, and M. Scholl, "From structured documents to novel query facilities," ACM Sigmod, 1994, 313-324.

[Consens95] M. P. Consens and T. Milo, "Algebras for querying text regions," ACM PODS, 1995, 11-22.

[Clark91] D. R. Clark, "Finite state transduction tools," Tech. Rept. OED-91-03, UW Centre for the New OED and Text Research, 1991.

[Clarke95] C. L. A. Clarke, G. V. Cormack, and F. J. Burkowski, "An algebra for structured text search and a framework for its implementation," The Computer Journal 38, 1 (1995) 43-56.

[Cordy91] J. R. Cordy, C. D. Halpern-Hamu, and E. Promislow, "TXL: a rapid prototyping system for programming language dialects," Computer Languages 16,1 (1991) 97-107.

[Fry81] J.P. Fry, et al., "Conversion technology, an assessment," Data Base 12,4 and \fI13,1 (1981) 39-61.

[Furuta88] R. Furuta and P. D. Stotts, "Specifying structured document transformations," \fIEP88, Document Manipulation and Typography, Proc. Int. Conf. on Electronic Publ. (J.C. van Vliet, ed.), Cambridge University Press (1988), 109-120.

[Gonnet87] G. H. Gonnet and F. W. Tompa, "Mind your grammar: a new approach to modelling text," Proc. Very Large Data Bases (VLDB 87), 13 (1987), 143-153.

[Gyssens89] M. J. Gyssens, J. Paradaens, and D. Van Gucht, "A grammar-based approach towards unifying hierarchical data models," ACM Sigmod, 1989, 263-272.

[Kazman86] F. N. Kazman, Structuring the Text of the Oxford English Dictionary through Finite State Transduction, M.Math. thesis, Dept. of Computer Science, Univ. of Waterloo, 1986.

[Kuikka94] E. Kuikka and M. Penttonen, "Transformation of structured documents with the use of grammars," Electronic Publishing — Origination, Dissemination, and Design 6, 3 (1994) 373-383.

[Macleod91] I. A. Macleod, "A query language for retrieving information from hierarchic text structures," The Computer Journa, 34, 2 (1991) 197-208.

[Mamrak89] S.A. Mamrak, M.J.Kaelbling, C.K.Nicholas, and M. Share, "Chameleon: a System for Solving the Data Translation Problem," IEEE Trans. on Soft. Eng. 15,9 (September 1989) 1090-1108.

[Medeiros86] C. M. B. Medeiros and F. W. Tompa, "Understanding the implications of view update policies," Algorithmica (1986) 337-360.

[Navarro95] G. Navarro and R. Baeza-Yates, "A language for queries on structure and contents of textual databases," Proc. ACM Sigir '95, 18, Seattle (July 1995) 93-101.

[Neff88] M.S. Neff, R.J. Byrd, and O.A. Rizk, "Creating and Querying Lexical Databases", Proc. of the 2nd Conf. on Applied Natural Language Processing, Assoc. for Comp. Ling., Feb. 9-12, 1988.

[Nikunen90] E. Nikunen, Views in structured text databases, University of Helsinki, Department of Computer Science, C-1990-60 (1990).

[Pratt71] T. W. Pratt, "Pair-grammars, graph languages, and string-to-graph translations," J. of Computer and System Sciences 5 (Dec. 1971), 560-595.

[Raymond88] D. R. Raymond and F. W. Tompa, "Hypertext and the Oxford English Dictionary," Comm. ACM 31, 7 (July 1988) 871-879.

[Sacks-Davis95] R.Sacks-Davis, A.Kent, K.Ramamohanarao, J.Thom, and J.Zobel, "Atlas: a nested relational database system for text applications," IEEE Trans. on Knowledge and Data Eng. 7,3 (1995) 454-470.

[Schabas83] A. H. Schabas and F. W. Tompa, "Trees and forests: user reactions to two page access structures," Videotex 83, New York (June 1983) 197-209.

[Tompa81] F. W. Tompa, J. Gescei, and G. v. Bochmann, "Data structuring facilities for interactive videotex systems," IEEE Computer 14, 8 (Aug. 1981) 72-81.

[Tompa88] F. W. Tompa and J. A. Blakeley, "Maintaining materialized views without accessing base data," Information Systems 13, 4 (1988) 393-406.

[Tompa89] F. W. Tompa, "A data model for flexible hypertext database systems," ACM Trans. on Information Systems 7, 1 (January 1989) 85-100.

[Vidal94] E. Vidal, "Grammatical inference: an introductory survey," Proc. 2nd Int. Colloq. on Grammatical Inference and Applications (IGCI-94), Spain, Lecture Notes in Computer Science #862 (R. C. Carrusco and J. Oncina, eds.) Springer Verlag (Sept. 1994) 1-4.