针对统计最优样本大小算法在确定大数据集,尤其是高维数据集抽样样本大小时的执行效率较低,以及高维数据集中每一维属性的重要性不同且可能存在冗余属性,提出一种基于特征选择的统计最优样本大小算法。该算法基于熵理论,通过构造一个基于对象间相似度的熵度量方法来评估特征重要性,然后根据设计的一种挑选特征的标准获得重要的特征子集,最后在该特征子集上执行统计最优样本大小算法。实验结果表明,改进后算法得到的样本大小抽取的样本集能够在聚类算法中得到较高的准确率,同时也较明显地降低了算法的执行时间,从而验证了改进后的算法是有效可行的。
Aiming at the low execution efficiency in statistical optimal sample size algorithm to determine sample size for sampling large datasets,especially high-dimensional datasets and the importance of each dimension for high-dimensional datasets is different,moreover,there may be redundant attributes,this paper proposed statistical optimal sample size algorithm based on feature selection. The algorithm made use of the entropy theory. It constructed an entropy measure of similarity between objects to evaluate the importance of each dimension,then obtained important feature subsets according to design a kind of evaluation standard,finally executed statistical optimal sample size algorithm in the feature subsets. Experimental results show that the improved algorithm not only can receive higher accuracy in the clustering algorithm,but also can obviously reduce the execution time of the algorithm,so the improved algorithm is efficacious and feasible.