在搜集缺陷预测数据集的时候,由于考虑了大量与代码复杂度或开发过程相关的度量元,造成数据集内存在维数灾难的问题。借助基于搜索的软件工程思想,提出一种新颖的基于搜索的包裹式特征选择框架SBFS。该框架在实现时,首先借助SMOTE方法来缓解数据集内存在的类不平衡问题,随后借助基于遗传算法的特征选择方法,基于训练集选出最优特征子集。在实证研究中,以NASA数据集作为评测对象,以基于前向选择策略的包裹式特征选择方法FW、基于后向选择策略的包裹式特征选择方法BW、不进行特征选择的方法Origin作为基准方法。最终实证研究结果表明:SBFS方法在90%的情况下,不差于Origin法;在82.3%的情况下,不差于BW法;在69.3%的情况下,不差于FW法。除此之外,若基于决策树分类器,则应用SMOTE方法后,可以在71%的情况下提高模型性能;而基于朴素贝叶斯和Logistic回归分类器,则应用SMOTE方法后,仅可以在47%和43%的情况下提高模型的预测性能。
During the process of gathering defect prediction datasets, the issue of curse of dimensionality may exist in these datasets when considering different metrics based on code complexity or development process. Motivated by the idea of search based software engineering, this paper proposed a novel search based wrapper feature selection framework SBFS. In implemen- ting this framework, it first used SMOTE approach to alleviate the issue of class imbalance, then used a genetic algorithm based feature selection method to select the optimal feature subset based on the training set. In empirical studies, it used NASA dataset as the subjects. Then it chose some classical baseline methods, such as forward search based wrapper feature selection method FW, backward search based wrapper feature selection method BW, and no feature selection method Origin. Finally results show that SBFS is no worse than Origin in 90% of cases,is no worse than BW in 82.3% of cases,and is no worse than FW in 69.3% of cases. Furthermore,when using decision tree classifier,using SMOTE can improve the model performance in 71% of cases. However when using Naive Bayes classifier or Logistic regression classifier, using SMOTE can only improve the model performance in 47% and 43% of cases respectively.