Please note: This master’s research paper presentation will take place in DC 3102.
Yuchen Pan, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Meng Xu
Machine Learning (ML)-based vulnerability detection has become increasingly important as software systems grow in complexity. However, existing function-level approaches are often hindered by the substantial noise present in publicly available datasets, which arises from automated data collection methods. This paper try to address this challenge by proposing the Uniform Positive Loss Adjustment (UPLA) method, which adjusts the loss for positively labeled data during training, mitigating the influence of mislabeled samples. Additionally, we explore Per-CWE training, where separate models are trained for distinct categories of vulnerabilities based on the Common Weakness Enumeration (CWE) system.
We evaluate the effectiveness of UPLA and Per-CWE training under various data compositions upon the BigVul dataset. Results show that UPLA consistently improves performance metrics such as F1 score and Matthews correlation coefficient (MCC) compared to baseline methods. While Per-CWE training does not outperform general-purpose models in our experiments, we observe that its performance deteriorates when data is scarce. Moreover, we emphasize the importance of including the after version of modified functions from vulnerability-fix commits (F2 functions) in datasets to avoid overestimating model performance. These findings provide insights into mitigating data quality issues and improving the training of machine learning models for function level vulnerability detection.