Towards Improving Statistical Modelling of Software Engineering Data Think Locally Act Globally
Authors -
Nicolas, Bettenburg;
Meiyappan, Nagappan and
Ahmed, E. Hassan
Venue -
Empirical Software Engineering, Accepted November 27 2013
Related Tags -
Abstract -
Much research in software engineering (SE) is focused on modeling
data collected from software repositories. Insights gained over the last decade
suggests that such datasets contain a high amount of variability in the data.
Such variability has a detrimental effect on model quality, as suggested by
recent research. In this paper, we propose to split the data into smaller homogeneous
subsets and learn sets of individual statistical models, one for each
subset, as a way around the high variability in such data. Our case study on
a variety of SE datasets demonstrates that such local models can significantly
outperform traditional models with respect to model fit and predictive performance.
However, we find that analysts need to be aware of potential pitfalls
when building local models: firstly, the choice of clustering algorithm and its
parameters can have a substantial impact on model quality. Secondly, the data
being modeled needs to have enough variability to take full advantage of local
modeling. For example, our case study on social data shows no advantage
of local over global modeling, as clustering fails to derive appropriate subsets.
Lastly, the interpretation of local models can become very complex when there
is a large number of variables or data subsets. Overall, we find that a hybrid
approach between local and traditional global modeling, such as Multivariate
Adaptive Regression Splines (MARS) combines the best of both worlds.
MARS models are non-parametric and thus do not require prior calibration of
parameters, are easily interpretable by analysts and outperform local, as well
as traditional models out of the box in four out of five datasets in our case
study.
Preprint -
PDF
BibTex -
@article{Bettenburg2013,
author = {Nicolas, Bettenburg and Meiyappan, Nagappan and Ahmed, E. Hassan},
keyword = {Defect Prediction},
title = {Towards Improving Statistical Modelling of Software Engineering Data Think Locally Act Globally},
type = {journal},
venue = {Empirical Software Engineering, Accepted November 27 2013}
}