Towards Improving Statistical Modelling of Software Engineering Data Think Locally Act Globally

Authors -

Nicolas, Bettenburg; Meiyappan, Nagappan and Ahmed, E. Hassan

Venue -

Empirical Software Engineering, Accepted November 27 2013

Related Tags -

Abstract -

Much research in software engineering (SE) is focused on modeling data collected from software repositories. Insights gained over the last decade suggests that such datasets contain a high amount of variability in the data. Such variability has a detrimental effect on model quality, as suggested by recent research. In this paper, we propose to split the data into smaller homogeneous subsets and learn sets of individual statistical models, one for each subset, as a way around the high variability in such data. Our case study on a variety of SE datasets demonstrates that such local models can significantly outperform traditional models with respect to model fit and predictive performance. However, we find that analysts need to be aware of potential pitfalls when building local models: firstly, the choice of clustering algorithm and its parameters can have a substantial impact on model quality. Secondly, the data being modeled needs to have enough variability to take full advantage of local modeling. For example, our case study on social data shows no advantage of local over global modeling, as clustering fails to derive appropriate subsets. Lastly, the interpretation of local models can become very complex when there is a large number of variables or data subsets. Overall, we find that a hybrid approach between local and traditional global modeling, such as Multivariate Adaptive Regression Splines (MARS) combines the best of both worlds. MARS models are non-parametric and thus do not require prior calibration of parameters, are easily interpretable by analysts and outperform local, as well as traditional models out of the box in four out of five datasets in our case study.

Preprint -


BibTex -

 author = {Nicolas, Bettenburg and Meiyappan, Nagappan and Ahmed, E. Hassan},
 keyword = {Defect Prediction},
 title = {Towards Improving Statistical Modelling of Software Engineering Data Think Locally Act Globally},
 type = {journal},
 venue = {Empirical Software Engineering, Accepted November 27 2013}