Daniel M. Berry
Cheriton School of Computer Science
University of Waterloo
Waterloo, ON, Canada
This talk notes the advanced state of the natural language (NL) processing art and considers four broad categories of tools for processing NL requirements documents. These tools are used in a variety of scenarios. The strength of a tool for a NL processing task is traditionally measured by its recall, precision, and their simple harmonic mean, the F-measure.
A hairy requirements or software engineering task involving NL documents is one that is not inherently difficult for NL-understanding humans on a small scale but becomes unmanageable in the large scale, such as occurs in industrial software development projects. A hairy task demands tool assistance. Because humans need far more help in carrying out a hairy task completely than they do in making the local yes-or-no decisions, a tool for a hairy task should have as close to 100% recall as possible, even at the expense of high imprecision. A tool that falls short of 100% recall may even be useless, e.g., when the software involved has high-dependability requirements, because to find the missing information, a human has to do the entire task manually anyway. Any such tool based on NL processing techniques inherently fails to achieve 100% recall, because even the best parsers are no more than 91% correct. Therefore, to achieve 100% recall in a tool for a hairy task, it needs to be based on something other than traditional NLP. Perhaps a dumb, clerical tool doing an identifiable part of such a task may be better than an intelligent tool trying but failing in unidentifiable ways to do the entire task.
The reality is that a tool's achieving exactly 100% recall, which may be impossible anyway, may not be necessary. It suffices for a human working with the tool on a task to achieve better recall than a human working on the task entirely manually.
This talk describes research whose goal is to discover and test a variety of non-traditional approaches to building tools for hairy tasks to see which, if any, allows a human working with with the tool to achieve better recall than a human working entirely manually. Among the early results are (1) some advice about the correct balance between recall and precision and the resulting weighted F-measure to use to evaluate tools for hairy tasks (2) and the introduction of a new measure, summarization.
Joint work with Ricardo Gacitua, Pete Sawyer, and Sri Fatimah Tjong