“Dynamic Selection of Fitness Function in Genetic Algorithm for Feature Selection in Software Defect
Prediction”.

himanshu
7 min readNov 27, 2019

--

Let’s start from the basics.

The cost for correcting an error increases as it proceeds through the software development life-cycle and it is therefore important to detect errors as soon as possible. Automating the task of predicting whether a class is defective or not in object-oriented software can make the process of software testing more accurate, reliable and convenient. In previous works Object-oriented metrics describing the structure of classes have been employed to predict defectiveness of a class. In this article, I try to describe a novel algorithm for software defect prediction and explain how genetic algorithms fit into the picture.

Before diving into the architecture and the genetic algorithms part. Let’s First get familiar with the data-set and the problem statement.

The data-set contains data of twenty-four different software systems from the PROMISE repository. These are open source software systems implemented in JAVA. Data-sets contain twenty Object Oriented metrics as independent variables and number of defects as the dependent variable. Since this problem is concerned only with the presence or absence of a defect in a class, the dependent variable has been converted to Boolean values- True and False.

Data Preprocessing

Because the data I used was imbalanced with more negative samples than positive ones I had to somehow rectify this problem.

There are systematic algorithms that could have used to generate synthetic samples. The most popular of such algorithms is called SMOTE or the Synthetic Minority Over-sampling Technique.

As its name suggests, SMOTE is an oversampling method. It works by creating synthetic samples from the minor class instead of creating copies. The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighbouring instances.As this article is mainly not concerned with SMOTE we will not delve deeper into this.You can read more about SMOTE here.

Why Genetic Algorithms!

Many common applications of predictive analytics arise from complex relationships between features. Feature selection is the process of finding the most relevant variables for a predictive model. These techniques can be used to identify and remove irrelevant and redundant features that do not contribute or decrease the accuracy of the predictive model.

Viola! Now it makes sense.One of the most advanced algorithms for feature selection is the genetic algorithm.The genetic algorithm repeatedly modifies a population of individual solutions(a subset S’ of all features’ set S). At each step, the genetic algorithm selects individuals at random from the current population to be parents and uses them to produce the children for the next generation. Over successive generations, the population “evolves” toward an optimal solution. You can apply the genetic algorithm to solve a variety of optimization problems that are not well suited for standard optimization algorithms.

Now we know most of the bits and pieces of the algorithm.Let’s have a look at the architecture and how genetic algorithms play their part.

Before we get into the architecture let’s understand some of the terminologies:

  1. Mutation-rate:
  2. Crossover-rate:
  3. Pool-size:
  4. Generations:
Architecture for software defect prediction.

The DSF(Dynamic selection of Fitness Function) technique enables us to combine multiple feature selection techniques into a single technique to build an enhanced classifier for defect prediction. It is analogous to creating an ensemble out of stand-alone machine learning techniques. It is important to note here, that DSF, unlike Voting and Validation, a popular ensemble technique does not involve the simultaneous execution of all the component fitness functions for feature selection for classifying a test class instance as defective or not defective. Instead, it takes into account the structural characteristics of the class instance and how likely the features selected by a given fitness function were better suited for the purpose of software defect prediction. Feature selection is a task to select a minimum number of features needed to represent a data to be able to distinguish from each class. Feature subset selection is a central topic of research in classification problem and has been studied for many years. Selection of relevant features, as well as elimination of irrelevant ones can improve classification accuracy, simplify procedure, and shorten the learning period.

A detailed working of the DSF is as follows:

(i) Let F = {F 1 ,F 2 ,….F m } be a collection of m different fitness functions. Let S be a software set consisting of N classes and their structural characteristics in form of OO(object-oriented) metrics. During the training phase, for each of the N classes, each fitness function Fi; i∈ [1,m] is used to predict the features using genetic algorithm using a mutation rate of 0.001, crossover rate 0.06, pool size of size 50 and 10 generations.

(ii) For each fitness function F i ; i ∈ [1,m], a Random forest classifier(the RF classifier was found to be the best machine learning technique in the domain of defect prediction. Furthermore, RF was used by 59% of studies which used ensemble classifiers for defect prediction) is trained using leave one out cross-validation or ten-fold cross-validation depending on the size of the dataset using the features predicted by F i as the independent variable and the presence or absence of defects as the dependent variable. Performance measures with respect to the labelled defect presence in that class instance are recorded. The time complexity of
this step depends on the following factors:-
(a) The parameter settings in the random forest classifier
(b) The number of fitness functions in the set F

(iii) After the first two steps, each class of a software system is labelled with the predicted presence/absence of defect by training RF on the features predicted by the particular fitness function. We now have to choose the features predicted by which fitness function F i are most likely to correctly predict the defect susceptibility of a class.For this purpose, I have taken a two-fold approach:

(a) If RF trained on the features predicted by one fitness function is able to identify correctly the presence/absence of a defect in a class, then that class instance is labelled with that fitness function, otherwise

(b) The class instance is labelled with the fitness function that has the greatest AUC measure on that particular software.

(iv) A modified data-set, say D’ is thus generated which relates the OO metrics of a class to the fitness function that must be chosen to effectively predict the defect susceptibility of that class. The independent variables, in this case, are the OO metrics of a class; the dependent variable being the fitness function selected for that class instance using the process as described above. A Support Vector Classifier is trained on this data-set using leave one out cross-validation technique for data-sets having less than or equal to 100 points, and ten-fold cross-validation for larger data sets.
(v) The fitness technique predicted by the SVC is used to select the attributes from that class instance. A trained random forest classifier finally predicts the defect susceptibility of that particular class instance. The values thus obtained can be matched against the labels assigned to the classes in the original data-set to calculate various performance measures.
An important point to note is that the DSF technique does not involve simultaneous or sequential execution of all component fitness functions techniques; only one fitness function technique deemed most suited by the SVC is tested on the class instance and results are validated.

Selection of Fitness Function.
(i)Accuracy =(TD+TND)/(TD+TND+FD+FND)

(ii) F1 Score - 2∗(Precision∗Recall)/(Precision+Recall)

(iii)G- Mean = √(Precision ∗ Recall).

The genetic algorithms using the fitness functions were used on the preprocessed data and selected features for each fitness function were achieved.

Selected Features.

A Random Forest classifier is trained on the features selected by these fitness functions. Based on the results obtained; a support vector classifier is trained on the annotated data-set D’ and predictions obtained. The performances of the fitness functions and their combination is judged on the basis of the AUC scores obtained:

We can clearly see that most of the times the DSF scores better than the individual Fitness Functions.

Wrapping It Up!

  1. Thus we can conclude that genetic algorithms are a powerful tool and can be applied for feature selection for a variety of problems.
  2. Ensemble techniques most of the times are more powerful than the individual standalone classifiers.

If you found this helpful, click the 💚 so more people will see it here on Medium.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

himanshu
himanshu

Written by himanshu

Software developer and tech enthusiast.

Responses (1)

Write a response