Flexible text visualization with applications to CASE

Frank Tompa and Darrell Raymond

Summary

The creation and maintenance of large software systems has long been recognized as a major expense and bottleneck in business development. Software engineering practices rely extensively on large text resources in the form of software specifications, source code, and documentation. We will examine the utility of managing these resources as components of large text databases so that they can be queried and manipulated in support of software design, production, and maintenance efforts.

In order to apply text database technology to support software engineering, several advances need to be made to accommodate text characteristics and uses commonly found in programming environments. Of particular interest are the intricate, interwoven aspects of symbol specification, declaration, and use. For example, the role of a variable or function might be described in a design document, its declaration found in a separately compiled header file, and its uses made explicit throughout a code block or embedded in macro invocations. As a further complication, conditional compilation might inhibit an engineer's understanding of the text representing a program.

Within the area of text-dominated databases, we have identified the following specific research problems to be addressed:

We will extend the text-dominated database software we have developed in response to the task of computerizing the Oxford English Dictionary and other literary and reference texts. Prototypes will be made available at the IBM Canada Laboratory to be evaluated by practicing software engineers. Since the approaches will not be tied to any specific CASE technologies, the results will be of benefit to all software developers.

Milestones

Milestone

Definition

Date

0

Delivery of current display software to IBM Labs

Feb. 1991

1

Trial: initial coding from specifications

Apr. 1991

2

Trial: detecting inter-module dependency cycles

July 1991

3

Trial: understanding an individual module's code

Dec. 1991

4

Trial: recoding from modified specifications

May 1992

5

Specification and implementation of mechanism(s) for displaying text that incorporates dynamically defined regions

Aug. 1992

6

Specification and implementation of mechanism(s) for displaying text with heterogeneous levels of detail

Dec. 1992

7

Specification and implementation of mechanism(s) for displaying and navigating compound documents

Mar. 1993

8

Trial: recoding from modified specifications

May 1993

9

Final report

Dec. 1993

Outline of Proposed Research

Objective

Text-dominated database management is a new field of study that has evolved from research in conventional databases, information retrieval, and document management. The goal of the proposed research is to investigate aspects of text database design, creation, manipulation, and querying that can support the production and maintenance of large software systems. Thus the major contributions of the research are expected to be in the area of text-dominated databases, with computer-assisted software engineering (CASE) serving as the primary motivational force.

Within the area of text-dominated databases, we have identified the following specific research problems to be addressed:

    1. the design and support of mechanisms for displaying text that incorporates dynamically defined regions, and
    2. support for multiple, simultaneous visualizations of compound documents, including support for intra-document and inter-document cross-reference resolution

These will be elaborated below, with reference to applications drawn from CASE.

Nature and Potential Value of the Proposed Research

Our work with text-dominated databases arose from the requirements of computerizing the Oxford English Dictionary [Weiner85, Berg88, Murphy89]. Having realized that conventional database technology could not be applied directly to text [Tompa89b, Tompa91], we developed new approaches and algorithms for text search [Gonnet91], text display [Raymond90], text transduction [Kazman86], and text database manipulation [Gonnet87], among others [Tompa89c]. Apart from the research results and the direct benefits to the Oxford University Press, the software is now commercially available through a newly-formed Ontario company, Open Text Systems, which distributes the software to academic, government and various business organizations.

A critical component to our success was the close cooperation with lexicographers at the Oxford University Press, who served as potential users and evaluators of the practical aspects of our research. For example, their needs for automatically extracting from the OED a pre-draft for the Shorter Oxford English Dictionary resulted in extensive development and testing of the Goedel language and implementation [Blake91].

We intend to repeat the success through a similar approach. Software engineering practice relies extensively on large text resources in the form of software specifications, source code and documentation; effective management of these resources in support of CASE will be used to direct and test our approaches to text-dominated database management.

Software engineers at the IBM Canada Laboratory, having reviewed the research and tools developed to date, have agreed to work with us to extend the research by serving as potential users and evaluators of text tools that support their work on software design, production, and maintenance. We will work with researchers at IBM to identify the most pressing needs and to formulate initial approaches to solutions. As we solve text database problems, prototypical software will be made available at the IBM Lab for evaluation by practicing software engineers. The feedback will be used to complement theoretic results in order to evaluate and reformulate research directions.

Just as our prior research proved to be applicable far beyond the needs of lexicographers, we expect that further results in text-dominated databases will apply to areas beyond software engineering. However, even without such extended applications, should the management of text resources prove to have potential value in creating and maintaining software at IBM, the results will be of immediate benefit to them and to other software developers in Canada. This alone will provide economic benefits by helping Canadian software developers to become more productive.

Detailed project description

Computer-aided software engineering is a highly diverse subject, ranging over program verification, software testing, development in-the-large, programming language design, construction of reusable components, and development methodologies. For our purposes, however, there are two basic CASE philosophies:

The prototypical example of the first philosophy is the development of new programming languages. The view here is that certain programming problems are endemic to current notations, and thus we can make significant changes only by employing a new type of notation. Fortran, the first high-level language, was developed with the twin goals of efficient code production and the provision of a notation more familiar to engineers. Block structured, strongly-typed languages such as Pascal were intended to resolve the problem of tangled control flow and provide explicit data abstraction [Jensen79]. Functional languages were designed to avoid the von Neumann bottleneck and make programs more amenable to verification [Backus78]. Logic programming languages were intended to provide descriptive rather than programmatic specification [Clocksin81]. Most recently, object-oriented languages attempt to provide modularity by focusing attention on reusable components and the addition of functionality to data abstractions [McGregor90].

The second basic CASE philosophy starts from the idea that new languages will not help us address the maintenance of the extensive base of existing software. Thus, the focus of the second philosophy has been on the design of tools to enhance or improve the development and maintenance of existing code. Structure-based editors, for instance, use knowledge about the language to increase programming productivity and reduce syntax errors [Teitelbaum81]. When such editors are coupled with relational database technology, code becomes a type of database that can be queried about use of variables or other characteristics [Horwitz85]. Software that extracts implicit information about data and control flow can aid the comprehension of unstructured programs [Green84]. This type of information can also be made explicit through the use of typography and other document layout principles for enhancing the readability of text [Baecker86]. Since documentation and code are usually consulted simultaneously, merging the two into a single document can result in a more useful and more easily maintained artefact [Knuth83].

The reliance on text is independent of particular methodologies and indicates the potential value of improved text tools to the success of system and application development and maintenance.

Text display from descriptive markup

Software engineering is based on the principle that software development involves communication among humans at least as much as human-computer communication. Thus specifications are developed to record system requirements, system design, and software functions [Parnas90]; and coding practice emphasizes readability-in-the-small (through mnemonic identifiers, code layout, inclusion of commentary, etc.) and readability-in-the-large (through modularity). To aid such communication, a display tool should produce not only legible output, but more importantly high quality output; it should be robust under scrolling, browsing, and updating; it should produce displays rapidly; and it should provide simple specification of displays.

In previous work we have dealt extensively with descriptive mark-up (such as SGMLs tags [Goldfarb90]). As well as designing tag sets suitable for representing specific texts, we have been involved in standardization efforts (including the Text Encoding Initiative [Sperberg-McQueen90]). We have also built a novel interactive text formatter, known as Lector [Raymond90]. Tags encountered in the text serve as triggers for activating or altering the display of the text according to a user-defined, but non-procedural, specification. Through menu options, users of Lector can browse through the text and choose alternative views, which have typically been designed to suppress and/or highlight certain text components.

Despite its success, Lector is limited: it has a restricted notion of "tag" and a limited layout capability. We wish to explore the extent to which we can extend the functionality of a tool such as Lector while, at the same time, not losing its simplicity and speed.

Lector has two naturally separable phases: a transducer phase and a display phase. In the transducer phase, the source text is tokenized and the tokens are mapped to procedurally marked-up intermediate text. In the display phase the intermediate text is displayed, using the appropriate device driver, on a specific device. It is well-accepted in the electronic publishing community that device-independence is worth the extra effort. The approach raises three problems: we need to identify what tags are (and thus how we are going to generate the tokenizer); we need to design the intermediate procedural mark-up language, powerful enough to support innovative visualization of text and its structure; and we need a mechanism to specify which particular actions in the language to associate with specific tags.

We have discovered that we need to be able to trigger display activity not only statically by previously introduced tags, as is done so far, but also dynamically, by, for example, asking that all uses of some program variable be highlighted. The first and most elegant solution is to embed new tags dynamically in the text and to introduce new specifications dynamically. The second approach is to change the tokenizer dynamically to recognize the queried objects as tags, and the third approach is to use pointers into the text with temporary or default display specifications. The research/development problem here is: Which of these is best?

Specifying which actions to associate with specific tags in a specific view is similar to formulating a request to transform the document. We will determine how to increase the flexibility of display specification techniques, for example, to include more localized controls for the amount of expansion for any piece of text (including provisions for generalizations of "fish-eye" views to support documents described by "nested includes," such as ones using #include in C. Finally we will increase the flexibility of the display system to provide for interactive creation and modification of display specifications and bindings (e.g., by manipulating a sample text).

Prototype solutions to these problems will be evaluated through several experiments designed by Arthur Ryman, at the IBM Lab [Ryman90]:

Compound documents

Document dispersion occurs whenever we have multiple versions of documents, multilingual documents, modular decomposition of documents, or security control over parts of a document. From this grows the need to visualize compound documents, or "hypertexts" [Conklin87, Tompa89a].

Given a compound document, we can either hide the fragments by displaying them in a linearized view, or we can reveal them through a piece-wise linear view in multiple panes. In either approach the fragments of an assembled document can be arranged in any order before being subjected to the type of view mechanism described above. The goal is to design and implement a visual query facility with adequate expressive power to permit users to explore a complex document using both structural and textual criteria, and to take advantage of Lector for sophisticated display of the results of a query. We will investigate this problem generally, as well as focusing on multilingual documents, such as a program, together with its specification in a formal language and its natural-language documentation.

Unlike traditional hypertext research, we believe that interrelationships among units of text cannot be defined independently of their uses. For example, from one line in the source code for a function, a software engineer might want to link to the corresponding specification, to the target of a call, to the declaration of a variable, to the definition of a macro, to the body of an included file, and so on. Furthermore, the software engineer might wish to link to similar fragments of code, where the basis of similarity can be shared use of a function, variable, macro, etc. Thus, browsing from segment to segment requires dynamic link definition and resolution [Raymond88].

We will investigate how text-based links can be introduced to the system, including how the required mappings can be specified (e.g., mapping from a user-selected point in code to the string denoting a function call coded at that point, to the string denoting a corresponding function declaration, to the point in the text where the matching function declaration appears). In addition, we will investigate the application of conventional relational technology to record and maintain such links when they can be pre-computed.

The experiments designed by Ryman will used to evaluate these aspects as well:

Conclusions

The major direction of the first year is to modify existing tools in an attempt to meet the demands of uses in software engineering. Thereafter, shortcomings will be addressed, likely requiring significant redesign in our approach to text display.

The final products of this research will be

References

[Backus78] J. Backus, "Can programming be liberated from the von Neumann style? a functional style and its algebra of programs," Comm. ACM 21,8 (Aug. 1978) 613-641.

[Baecker86] R. Baecker and A. Marcus, "Design principles for the enhanced presentation of computer program source text," Proceedings CHI '86 Conference on Human Factors in Computing Systems, Boston, MA (April 13-17, 1986) 51-58.

[Berg88] D.L. Berg, G.H. Gonnet, and F.W. Tompa, "The new Oxford English Dictionary project at the University of Waterloo," A. Zampolli (ed.), 1988.

[Blake91] G.E. Blake and T. Bray, "Shortening the OED: experience with a grammar-defined database," Tech. Rept., Waterloo Centre for the New OED and Text Research, in preparation.

[Clocksin81] W.F. Clocksin and C.S. Mellish, Programming in Prolog, Springer-Verlag, New York, 1981.

[Conklin87] J. Conklin, "Hypertext: an introduction and survey," IEEE Computer 20,9 (Sept. 1987) 17-41.

[Goldfarb90] C. F. Goldfarb, The SGML Handbook, Oxford University Press, Oxford, 1990.

[Gonnet87] G.H. Gonnet and F.W. Tompa, "Mind your grammar: a new approach to modelling text" Very Large Data Bases (VLDB), Vol. 13 (September 1987), 339-346.

[Gonnet91] G.H. Gonnet, R.A. Baeza-Yates, and T. Snider, "Lexicographical indices for text: inverted files vs. Pat trees," Tech. Rept. OED-91-01, Waterloo Centre for the New OED and Text Research.

[Green84] T.R.G. Green and A.J. Cornah, "The programmer's torch," Interact '84  First IFIP Conference on Human-Computer Interaction, London, England, (September 4-7, 1984) 119-124.

[Horwitz85] S. Horwitz and T. Teitelbaum, "Relations and attributes: a symbiotic basis for editing environments," SIGPLAN '85, Seattle, WA, (June 25-28, 1985) 93-106.

[Jensen79] K. Jensen and N. Wirth, Pascal User Manual and Report (2nd edn.), Springer-Verlag, New York, N.Y.,1979.

[Kazman86] F.N. Kazman, "Structuring the text of the Oxford English Dictionary through finite state transduction," MMath. Thesis, Univ. of Waterloo, 1986.

[Knuth83] D. E. Knuth, "Literate programming," Tech. Rept. STAN-CS-82-981, Dept. of Computer Science, Stanford University (September 1983).

[McGregor90] J. D. McGregor and T. Korson (eds.), "Special Issue: Object-Oriented Design," Comm. ACM 33,9 (September 1990).

[Murphy89] C. Murphy, "Caught in the web of bytes," The Atlantic 263,2 (Feb. 1989) 68-70.

[Parnas90] D. L. Parnas, G. J. K. Asmis, and J. Madey, "Assessment of safety-critical software," Tech. Rept. 90-295, Dept. of Computing and Information Science, Queen's Univ. (December 1990).

[Raymond88] D.R. Raymond and F.W.Tompa, "Hypertext and the Oxford English Dictionary" Comm. ACM Vol. 31, No. 7 (July 1988), 871-879. (Also appeared as electronic publications for Hypercard, Hyperties, and Guide hypertext systems.)

[Raymond89] D.R. Raymond, A.J. Cañas, F.W. Tompa, F. Safayeni, "Measuring the effectiveness of personal database structures," Int. J. Man-Machine Studies 31, (Sept. 1989) 237-256.

[Raymond90] D.R. Raymond, "Lector  an interactive formatter for tagged text," Tech. Rept. OED-90-02, Waterloo Centre for the New OED and Text Research.

[Ryman90] A. G. Ryman, "Lector experiments," unpublished document, IBM Canada Laboratory, Nov. 29, 1990.

[Sperberg-McQueen90] C. M. Sperberg-McQueen and L. Burnard (eds.), "Guidelines for the encoding and exchange of machine-readable texts, version 1.0" Tech. Rept. TEI-P1, Text Encoding Initiative (July 15, 1990).

[Teitelbaum81] T. Teitelbaum and T. Reps, "The Cornell Program Synthesizer: a syntax-directed programming environment," Comm. ACM 24,9 (September 1981) 563-573.

[Tompa89a] F.W.Tompa, "A data model for flexible hypertext database systems," Trans. Office Information Systems 7,1 (January 1989), 85-100.

[Tompa89b] F.W. Tompa, "What is (tagged) text?" Dictionaries in the Electronic Age, Proc. 5th Conf. of Univ. of Waterloo Centre for the New OED (September 1989) 81-93.

[Tompa89c] F.W. Tompa, G.H. Gonnet, and T. Bray, "CRD-0039063: Text-dominated databases  final report," Unpublished report to NSERC, Dept. of Computer Science, Univ. Waterloo, December 1989.

[Tompa91] F.W. Tompa and D.R.Raymond, "Database design for a dynamic dictionary," Research in Humanities Computing, accepted to appear 1991, 324-336.

[Weiner85] E.S.C. Weiner, "Computerizing the Oxford English Dictionary," Scholarly Publishing 16.3 (1985) 239-253.