Abstract: Software engineering discipline contains several prediction approaches such as test effort prediction, correction cost prediction, fault prediction, reusability prediction, security prediction, effort prediction, and quality prediction. However, most of these prediction approaches are still in preliminary phase and more research should be conducted to reach robust models. Software fault prediction is the most popular research area in these prediction approaches and recently several research centers started new projects on this area. In this study, we investigated 90 software fault prediction papers published between year 1990 and year 2009 and then we categorized these papers according to the publication year. This paper surveys the software engineering literature on software fault prediction and both machine learning based and statistical based approaches are included in this survey.
Papers explained in this article reflect the outline of what was published so far, but naturally this is not a complete review of all the papers published so far. This paper will help researchers to investigate the previous studies from metrics, methods, datasets, performance evaluation metrics, and experimental results perspectives in an easy and effective manner. Furthermore, current trends are introduced and discussed.
Abstract: Noise detection for software measurement datasets is a topic of growing interest. The presence of class and attribute noise in software measurement datasets degrades the performance of machine learning-based classifiers, and the identification of these noisy modules improves the overall performance. In this study, we propose a noise detection algorithm based on software metrics threshold values. The threshold values are obtained from the Receiver Operating Characteristic (ROC) analysis. This paper focuses on case studies of five public NASA datasets and details the construction of Naive Bayes-based software fault prediction models both before and after applying the proposed noise detection algorithm. Experimental results show that this noise detection approach is very effective for detecting the class noise and that the performance of fault predictors using a Naive Bayes algorithm with a logNum filter improves if the class labels of identified noisy modules are corrected.
Abstract: Despite the amount of effort software engineers have been putting into developing fault prediction models, software fault prediction still poses great challenges. This research using machine learning and statistical techniques has been ongoing for 15 years, and yet we still have not had a breakthrough. Unfortunately, none of these prediction models have achieved widespread applicability in the software industry due to a lack of software tools to automate this prediction process. Historical project data, including software faults and a robust software fault prediction tool, can enable quality managers to focus on fault-prone modules. Thus, they can improve the testing process. We developed an Eclipse-based software fault prediction tool for Java programs to simplify the fault prediction process. We also integrated a machine learning algorithm called Naive Bayes into the plug-in because of its proven high-performance for this problem. This article presents a practical view to software fault prediction problem, and it shows how we managed to combine software metrics with software fault data to apply Naive Bayes technique inside an open source platform.
Abstract: Software quality assessment models are quantitative analytical models that are more reliable compared to qualitative models based on personal judgment. These assessment models are classified into two groups: generalized and product-specific models. Measurement-driven predictive models, a subgroup of product-specific models, assume that there is a predictive relationship between software measurements and quality. In recent years, greater attention in quality assessment models has been devoted to measurement-driven predictive models and the field of software fault prediction modeling has become established within the product-specific model category. Most of the software fault prediction studies focused on developing fault predictors by using previous fault data. However, there are cases when previous fault data are not available. In this study, we propose a novel software fault prediction approach that can be used in the absence of fault data. This fully automated technique does not require an expert during the prediction process and it does not require identifying the number of clusters before the clustering phase, as required by the K-means clustering method. Software metrics thresholds are used to remove the need for an expert. Our technique first applies the X-means clustering method to cluster modules and identifies the best cluster number. After this step, the mean vector of each cluster is checked against the metrics thresholds vector. A cluster is predicted as fault-prone if at least one metric of the mean vector is higher than the threshold value of that metric. Three datasets, collected from a Turkish white-goods manufacturer developing embedded controller software, have been used during experimental studies. Experiments revealed that unsupervised software fault prediction can be automated fully and effective results can be achieved by using the X-means clustering method and software metrics thresholds.
Abstract: Organizations generally lose their domain experiences when key developers leave from the organization which doesn't have a powerful and effective infrastructure to collect, package, validate, and spread experience. In a recent project aimed at building a general purpose embedded application development platform, we developed an Eclipse-based IDE to accelerate our embedded development process, codify our Linux embedded software development knowledge on one extensible platform, and standardize tools, scripts, and libraries within our organization. This paper shows the approach that we used to collect domain-specific experience, component-based layered architecture of Eclipse-based platform, and our experiences on Eclipse.
Abstract: Software metrics and fault data belonging to a previous software version are used to build the software fault prediction model for the next release of the software. Until now, different classification algorithms have been used to build this kind of models. However, there are cases when previous fault data are not present; and hence, supervised learning approaches cannot be applied. In this study, we propose a fully automated technique which does not require an expert during the prediction process. In addition, it is not required to identify the number of clusters before the clustering phase, as required by K-means clustering method. Software metrics thresholds are used to remove the expert necessity.
Our technique first applies X-means clustering method to cluster modules and identifies the best cluster number. After this step, the mean vector of each cluster is checked against the metrics thresholds vector. A cluster is predicted as fault-prone if at least one metric of the mean vector is higher than the threshold value of that metric. In addition to X-means clustering-based method, we made experiments with pure metrics thresholds method, fuzzy clustering, and K-means clustering-based methods. Experiments reveal that unsupervised software fault prediction can be fully automated and effective results can be produced using X-means clustering with software metrics thresholds. Three datasets, collected from Turkish white-goods manufacturer developing embedded controller software, have been used for the validation.
Abstract: Detection of outliers in software measurement datasets is a critical issue that affects the performance of software fault prediction models built based on these datasets. Two necessary components of fault prediction models, software metrics and fault data, are collected from the software projects developed with object-oriented programming paradigm. We proposed an outlier detection algorithm based on these kinds of metrics thresholds. We used Random Forests machine learning classifier on two software measurement datasets collected from jEdit open-source text editor project and experiments revealed that our outlier detection approach improves the performance of fault predictors based on Random Forests classifier.
Abstract: Since today’s software is more complex than ever, software quality should be managed with an engineering approach called Software Quality Engineering (SQE). Even though there are several Quality Assurance techniques inside SQE, software testing is still the most dominant quality assurance activity in the software sector. Unmanned aerial vehicles (UAV), inter-continental ballistic missiles (UCBM), and combat robots to deal with roadside bombs are some examples of complex systems that must be reliable, secure, available, and safe. Software Engineering Institute (SEI) published a report in 2006 and proposed a research agenda for U.S. Department of Defense about ultra-large-scale systems which will likely to have billions of lines of code within 30-50 years. “Adaptable and Predictable System Quality” is a research area proposed by SEI for future ultra-large-scale systems to maintain quality under attacks and failures.
Complexity is one of the most important internal quality factors locating under the software quality iceberg and avoiding unnecessary complexity during software development makes systems more secure. Current quality prediction models apply several complexity metrics to predict fault-prone modules and these models can be adapted to identify vulnerability-prone or attack-prone components by using several security metrics together with complexity metrics. Software engineering discipline contains several prediction approaches such as test effort prediction, correction cost prediction, defect prediction, reusability prediction, and quality prediction. We proposed a software life cycle called prediction-centric software life cycle including some of these prediction approaches and we believe this life cycle will improve the quality and make the software quality predictable. We suggest that early prediction of software faults with quality prediction models, early identification of vulnerability-prone components with security prediction models, and a prediction-centric life cycle are strong elements for systems of the future.
Abstract: Software product line engineering is a growing recent paradigm to develop similar products using reusable core assets such as architecture and test cases. The general aim is to enhance quality and decrease development costs. Current software product line engineering frameworks apply only a few quality assurance activities but today’s single-system engineering has much more quality assurance activities that can be adapted to software product lines. In this study, software fault prediction sub-process is integrated into software product line engineering framework and the key activities are defined. This approach will improve quality and enhance testing process for software product lines.
Abstract: The features of real-time dependable systems are availability, reliability, safety and
security. In the near future, real-time systems will be able to adapt themselves according to
the specific requirements and real-time dependability assessment technique will be able to
classify modules as faulty or fault-free. Software fault prediction models help us in order to
develop dependable software and they are commonly applied prior to system testing. In this
study, we examine Chidamber-Kemerer (CK) metrics and some method-level metrics for our
model which is based on Artificial Immune Recognition System (AIRS) algorithm. The dataset
is a part of NASA Metrics Data Program and class-level metrics are from PROMISE
repository. Instead of validating individual metrics, our mission is to improve the prediction
performance of our model. The experiments indicate that the combination of CK and the lines
of code metrics provide the best prediction results for our fault prediction model. The
consequence of this study suggests that class-level data should be used rather than methodlevel
data to construct relatively better fault prediction models. Furthermore, this model can
constitute a part of real-time dependability assessment technique for the future.
Abstract: Software testing is a time-consuming and expensive process. Software fault prediction models are used to identify fault-prone classes automatically before system testing. These models can reduce the testing duration, project risks, resource and infrastructure costs. In this study, we propose a novel fault prediction model to improve the testing process. Chidamber-Kemerer Object-Oriented metrics and method-level metrics such as Halstead and McCabe are used as independent metrics in our Artificial Immune Recognition System based model. According to this study, class-level metrics based model which applies AIRS algorithm can be used successfully for fault prediction and its performance is higher than J48 based approach. A fault prediction tool which uses this model can be easily integrated into the testing process.
Abstract: Predicting fault-prone modules for software development projects enables companies to reach high reliable systems and minimizes necessary budget, personnel and resource to be allocated to achieve this goal. Researchers have investigated various statistical techniques and machine learning algorithms until now but most of them applied their models to the different datasets which are not public or used different criteria to decide the best predictor model. Artificial Immune Recognition System is a supervised learning algorithm which has been proposed in 2001 for the classification problems and its performance for UCI datasets (University of California machine learning repository) is remarkable. In this paper, we propose a novel software defect prediction model by applying Artificial Immune Recognition System (AIRS) along with the Correlation-Based Feature Selection (CFS) technique. In order to evaluate the performance of the proposed model, we apply it to the five NASA public defect datasets and compute G-mean 1, G-mean 2 and F-measure values to discuss the effectiveness of the model. Experimental results show that AIRS has a great potential for software defect prediction and AIRS along with CFS technique provides relatively better prediction for large scale projects which consist of many modules.