Master’s Thesis Presentation • Software Engineering — On the Effectiveness of Incremental Fact Extraction and Analysis | Cheriton School of Computer Science

Please note: This master’s thesis presentation will be given online.

Davood Anbarnam, Master’s candidate
David R. Cheriton School of Computer Science

Today’s software projects can be huge. They often consist of millions of lines of code, have multiple teams working on them and are constantly evolving. It is no surprise then that developers sometimes seek the help of advanced diagnostic tooling, such as static analysis tools, to aid the development process, with many modern Integrated Development Environments (IDEs) such as Eclipse and Visual Studio providing such functionality out-of-the-box. Fact Extraction is one such static analysis technique that extracts a base model of the underlying software containing properties of the system entities (e.g., variables, functions, files/classes) and their relationships, and stores them in the form of a database of facts (factbase). This base model can then be queried and analysed by developers to reveal higher-level design information, such as dataflow between various modules of a software system.

Currently, approaches to building system models scale fairly well to large single systems; factbases can be created using time and resources comparable to that of the compilation process. However, software systems evolve over time, and these analyses need to be redone as the source code changes. While incremental compilation techniques have the potential to greatly reduce the time taken to rebuild the systems themselves, as yet there has been little research into tools that support incremental analysis of changing artifacts to produce revised factbases. This thesis proposes an extraction and analysis framework that is more amenable to creating models from changing artifacts: an Incremental Extraction and Analysis Framework. In particular, we focus on how changes at the file level — i.e., modifications to source files as well as the addition and removal of source files — can impact a previously extracted model and analysis.

We evaluate our proposed framework by performing a case study on a build of the Linux Kernel. First, we compare two approaches to extracting different versions of the build: one that uses a traditional approach and one that uses our proposed incremental extraction techniques. Then, we compare two approaches to analysing two different, extracted versions of the build: one that uses an incremental approach and one that does not. We found that significant performance improvements can be made in re-extracting a base model when using an incremental approach, with extraction times being reduced by at least 50%, while re-analysing a base model using an incremental approach grows linearly in terms of the number affected facts in the best case and follows an S-shaped growth in the worst case. We found that the cause for the observed exponential growth could be traced to a tractable subset of facts, rather than being the result of a gradual increase of an analysis’ search space.