Please note: This PhD defence will take place in DC 2310.
Rafael F. Toledo, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Jo Atlee
Software engineers dedicate significant time and effort to debugging, analyzing, and understanding large, complex software. Such systems can comprise millions of lines of code that implement the program behaviour. When working on such maintenance tasks, the engineer needs to examine the code involved to understand exactly how the program’s behaviour is implemented before they can perform any changes or fixes. Depending on the complexity of the program behaviour, the engineer must navigate dozens of lines of code scattered across multiple files to comprehend a single instance of the analysis results. During this code navigation, they pose program comprehension questions that guide the building of a mental model of the program’s behaviour. It is well known that answering such queries can be time-consuming, error-prone and cognitively demanding. These risks and demands increase with the complexity of the software under study, for example, when analyzing software that is a software product line (SPL), where an SPL represents a family of related software product variants (e.g., different models of cellphones or vehicles sold by the same company).
Many of the above complexities can be addressed by working with a model of code because models are abstractions that are generally smaller, simpler, and more amenable to automated analyses. A software fact-based model is a collection of program facts that reflect the properties and behaviour of a software system. Program facts include source code entities (e.g., variables, functions), their attributes (e.g., names, source file), and their relationships (e.g., function calls, class inheritance). Program facts can be automatically extracted from source code with an enhanced parser, and the facts can be linked together into a fact-based model of the software system.
The resulting collection of software facts represents the system’s properties and behaviour as a graphical model that can be managed and queried using graph database technologies. Graph database systems and their native features enable efficient and optimal storage, querying, and visualization of the software fact-based model. Software queries and analyses can be expressed using the database’s query language. However, writing common queries from scratch can be repetitive and time-consuming, and, for large and complex queries, it can be error-prone. This thesis investigates whether fact-based software modelling and analysis can improve program comprehension of software systems, including variable systems.
This thesis makes three contributions: (1) identifying the program-comprehension questions that software fact-based models can support, (2) designing a query interface that facilitates program comprehension questions and supports incremental exploration of query results, and (3) developing an efficient visual encoding of results of queries on an SPL model.
We evaluated how well fact-based models can answer program-comprehension questions. Previous studies categorized program comprehension questions, but primarily focused on code-based questions rather than model-based questions. We performed a literature review to identify program-comprehension questions that can be posed to fact-based models. We correlated engineers’ information needs with the information that fact-based models supply through a comprehensive analysis of previous works on program comprehension questions and graph visualization. Finally, we demonstrated that 38 program comprehension questions could be answered by a fact-based model by expressing them as Cypher queries over a Neo4j factbase.
Secondly, we studied how to improve the engineer’s experience in understanding program facts through program-comprehension query templates and follow-up queries. We extended Neo4j Browser to support initial program-comprehension queries and follow-up queries over fact-based model elements, giving users greater control and precision in their exploration of the model. We conducted a user study comparing the use of our enhanced Neo4j Browser with a standard code editor, and it shows significant gains in users’ efficiency and reduced mental effort during program-comprehension tasks.
Finally, we studied how to improve an engineer’s comprehension of variable results from a fact-based analysis of an SPL. Analyzing an SPL model produces variable results, where each result may apply to some product variants and not others (e.g., if the analysis refers to feature-specific code). Variable analysis results are typically represented by annotating each result with a presence condition (PC), where the PC is a propositional formula that represents the product(s) for which the result holds. Thus, interpreting the variable analysis results of an SPL model involves determining the program variant (or group of variants) that applies to specific results, which can be error-prone and cognitively demanding. We developed ^Neo4j Browser, a modified version of Neo4j that provides features for filtering analysis results based on the feature configuration of SPL variants and highlighting the results associated with each filter. ^Neo4j Browser helps users to interpret variable results faster, more accurately, and with less mental effort.