特征选择是高维数据降维的一种关键技术。传统数据降维技术如PCA,只是转化数据的表达形式,不能表达数据的相关程度。近年来提出信息度量方法,使用评价函数表示数据的不确定性程度,虽然能较好地体现数据之间的相关程度,但并没有充分考虑选取的特征对整个样本空间的影响。针对传统方法的不足,提出一种基于贝叶斯和谐度特征选择算法。贝叶斯和谐度来自贝叶斯阴阳和谐学习理论,可以估计整个数据空间的联合概率分布,选取的特征能够较好地反应整个样本空间的变化。根据和谐度的变化来度量类之间的相似度从而得到冗余度较低的特征组合。与传统方法如ReliefF、FCBF等比较后发现,在取同样特征个数的情况下,和谐度度量得到的特征组合对数据分类更有效。
Feature selection is a key technique for the dimension reduction in high-dimensional data mining. Traditional data dimension reduction techniques such as PCA which just converts the expression of the dataset cannot reflect the rele- vant degree within the dataset. In order to select the optimal combination of features, an information measure is commonly used in the process of feature selection by the computation of the uncertain degree of the dataset. The choice of metric function in this method determines the feature subsets. Compared with the method using information entropy, a feature selection algorithm based on Bayesian harmony measure is presented in this paper. The increase in the harmony degree is used to measure the relevant degree within the selected features. The Bayesian harmony measure is introduced from the Bayesian Ying-Yang harmony learning theory, which can estimate the joint distribution of the input space. The similarity between two classes can be estimated by the change of Bayesian harmony. At the same time it can obtain the low redun- dancy feature combination by computing the harmony degree amplification in the entire system. Compared with traditional methods such as ReliefF and FCBF, the method has better performance in the case of selecting the same number of features.