Please note: This PhD defence will take place online. Also note the atypical start time — 9:00 p.m.
Yongqiang Tian, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Chengnian Sun
Deep Learning (DL) applications are widely deployed in diverse areas, such as image classification, natural language processing, and auto-driving systems. Although these applications achieve outstanding performance in certain metrics like accuracy, developers have raised strong concerns about their reliability since the logic of DL applications is a black box for humans. Specifically, DL applications learn their logic during stochastic training and encode it in high-dimensional weights of DL models. Unlike source code in conventional software, such weights are infeasible for humans to directly interpret, examine, and validate. As a result, the reliability issues in DL applications are not easy to detect and may cause catastrophic accidents in safety-critical missions. Therefore, it is critical to adequately assess the reliability of DL applications.
This thesis aims to help software developers assess the reliability of DL applications from the following three perspectives.
The first study proposes object-relevancy, a property that reliable DL-based image classifiers should comply with, i.e., the classification results should be made based on the features relevant to the target object in a given image, instead of irrelevant features such as the background. This study further proposes an automatic approach based on two metamorphic relations to assess if this property is violated in the image classifications. The evaluation shows that the proposed approach can effectively detect unreliable inferences violating the object-relevancy property, with an average precision of 64.1% and 96.4% for the two relations, respectively. The subsequent empirical study reveals that such unreliable inferences are prevalent in the real world and the existing training strategies cannot tackle this issue effectively.
The second study concentrates on the reliability issues induced by DL model compression. DL model compression can significantly reduce the sizes of Deep Neural Network (DNN) models, and thus facilitate the deployment of sophisticated, sizable DNN models. However, the prediction results of compressed models may deviate from those of their original models, resulting in unreliably deployed DL applications. To help developers thoroughly assess the impact of model compression, it is essential to test these models to find any deviated behaviors before dissemination. This study proposes DFLARE, a novel, search-based, black-box testing technique. The evaluation shows that DFLARE constantly outperforms the baseline in both efficacy and efficiency. More importantly, the triggering inputs found by DFLARE can be used to repair up to 48.48% of deviated behaviors.
The third study reveals the unreliable assessment of DL-based Program Generators (DLGs) in compiler testing. To effectively test compilers, DLGs are proposed to automatically generate massive testing programs. However, after thorough analysis of the characteristics of DLGs, this study found that the assessment of these DLGs is unfair and unreliable, since the chosen baselines, i.e., Language-Specific Program Generators (LSGs), are different from DLGs in many aspects. Furthermore, this study proposed Kitten, a simple, fair, and non-DL-based baseline for DLGs. The experiments show that DLGs cannot even compete against such a simple baseline and the claimed advantages of DLGs are likely due to the biased selection of the baseline. Specifically, Kitten triggers 1,750 hang bugs and 34 distinct crashes in 72-hours of testing on GCC, while the the-state-of-art DLG only triggers 3 hang bugs and 1 distinct crash. Moreover, the code coverage achieved by Kitten is at least 2x as of that achieved by the the-state-of-art DLG.