Please note: This master’s thesis presentation will take place in DC 2314 and online.
Yelizaveta Brus, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Shane McIntosh
Continuous Integration (CI) is a critical component of modern software development, automating the process of downloading, compiling, and testing patch sets. Unsurprisingly, it periodically fails due to non-deterministic (a.k.a., “flaky”) behavior. Since a patch set may not be the cause of a flaky failure, developers may issue a “recheck” command to request the CI to re-execute a build request on that patch set. While necessary in some cases, prior work shows that rechecks are often issued carelessly, leading to substantial waste. In the OpenStack community alone, an estimated 187.4 compute years have been wasted on rechecks that resulted in repeated failures. As software development scales, optimizing CI efficiency by reducing wasteful rechecks is essential for improving resource utilization, reducing operational costs, and enhancing developer productivity.
To mitigate unnecessary rechecks, I fit and analyze statistical models that discriminate between recheck requests where a failing outcome will (a) change to passing (i.e., successful rechecks) or (b) continue to fail (i.e., failed rechecks). My empirical study is based on 314,947 recheck requests collected from OpenStack over a 10-year period. I extract and analyze features related to bot behavior, job outcomes, user activity, patch characteristics, and the timing of rechecks. Using logistic regression with restricted cubic splines, I model nonlinear relationships and evaluate predictive performance using AUROC, AUPRC, and Brier score. My model achieves an AUROC of 0.736, outperforming baseline approaches by 23.6 percentage points (or 47.2\%), while maintaining strong calibration with a Brier score of 0.191. The model also produces an AUPRC of 0.604, exceeding the baseline performance by 73\%. If the requests that my model identifies as failed rechecks were skipped, 86.49\% of wasted recheck requests could have been avoided, saving substantial CI resources.
Our analysis reveals that the historical success rates of jobs, bots, and users are the strongest predictors of recheck outcomes. Feature importance analysis shows that the “jobs success ratio” and “bots success ratio” together account for 50\% of the explanatory power, indicating that certain jobs and bots are more prone to triggering unnecessary rechecks. In contrast, static patch characteristics, such as the number of modified lines or affected files contribute minimally to predictive performance. I also investigate the impact of different time windows for feature computation, finding that model performance remains stable regardless of the amount of historical data used. AUROC varies by only 0.59\% between a one-day and an all-time window, indicating that both short-term and long-term data can be effectively used to predict recheck outcomes.
Guided by the analysis, I suggest practical steps to improve CI efficiency. Since authors have little control over the past behaviour characteristics of their patch sets, the feedback of the model may be disheartening for individual contributors who wish to avoid issuing useless rechecks. Instead of applying model insights at the patch set level, my findings indicate that misbehaving bots could be throttled to limit excessive rechecks, unreliable jobs could have their voting power revoked, and automated feedback mechanisms could be introduced to discourage wasteful recheck requests. Additionally, organizations could integrate model-driven insights into CI dashboards, allowing teams to monitor inefficiencies and take corrective action in real time. By focusing on process-level improvements rather than individual developer actions, CI workflows can be optimized to reduce resource waste, accelerate build turnaround times, and enhance the overall developer experience.
To attend this master’s thesis presentation in person, please go to DC 2314. You can also attend virtually on Zoom.