Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction

Balogun, A.O. and Odejide, B.J. and Bajeh, A.O. and Alanamu, Z.O. and Usman-Hamza, F.E. and Adeleke, H.O. and Mabayoje, M.A. and Yusuff, S.R. (2022) Empirical Analysis of Data Sampling-Based Ensemble Methods in Software Defect Prediction. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 13381 . pp. 363-379.

Full text not available from this repository.
Official URL: https://www.scopus.com/inward/record.uri?eid=2-s2....

Abstract

This research work investigates the deployment of data sampling and ensemble techniques in alleviating the class imbalance problem in software defect prediction (SDP). Specifically, the effect of data sampling techniques on the performance of ensemble methods is investigated. The experiments were conducted using software defect datasets from the NASA software archives. Five data sampling methods (over-sampling techniques (SMOTE, ADASYN, and ROS), and undersampling techniques (RUS and NearMiss) were combined with bagging and boosting ensemble methods based on Naïve Bayes (NB) and Decision Tree (DT) classifier. Predictive performances of developed models were assessed based on the area under the curve (AUC), and Matthew�s correlation coefficient (MCC) values. From the experimental findings, it was observed that the implementation of data sampling methods further enhanced the predictive performances of the experimented ensemble methods. Specifically, BoostedDT on the ROS-balanced datasets recorded the highest average AUC (0.995), and MCC (0.918) values respectively. Aside NearMiss method, which worked best with the Bagging ensemble method, other studied data sampling methods worked well with the Boosting ensemble technique. Also, some of the developed models particularly BoostedDT showed better prediction performance over existing SDP models. As a result, combining data sampling techniques with ensemble methods may not only improve SDP model prediction performance but also provide a plausible solution to the latent class imbalance issue in SDP processes. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.

Item Type: Article
Impact Factor: cited By 1
Uncontrolled Keywords: Barium compounds; Decision trees; Defects; Forecasting, Boosting ensembles; Class imbalance; Data sampling; Ensemble methods; Ensemble techniques; Near-misses; Predictive performance; Sampling method; Sampling technique; Software defect prediction, NASA
Depositing User: Mr Ahmad Suhairi Mohamed Lazim
Date Deposited: 12 Sep 2022 08:18
Last Modified: 12 Sep 2022 08:18
URI: http://scholars.utp.edu.my/id/eprint/33731

Actions (login required)

View Item
View Item