在高维的基因表达谱数据中,只有少量基因对分类诊断其作用,而且还存在大量冗余的与癌症分类诊断无关的噪声基因,这些都会导致分类性能的下降.通过基因选择选取与分类紧密关联的基因,不仅能够剔除与疾病无关的基因,减少机器学习算法的时间复杂度和空间复杂度,提高分类的正确率,而且选出的特征基因可以作为肿瘤基因诊断和肿瘤药物治疗靶标确定的依据,降低后期生物学分析成本.本文提出一种基于聚类和粒子群算法(Particle swarm optimization,PSO)的基因选择方法,在PSO算法进行搜索之前,先对基因进行聚类,并对聚类结果进行选择,将被选中的簇的中心作为PSO的初始值,每个被选中的簇作为一个搜索空间,并利用极限学习机(Extreme learning machine,ELM)的分类精度作为特征选择的适应评价标准.该算法不仅有效地利用了聚类算法对基因进行初步归并的能力,也利用了PSO算法的全局优化能力,克服了传统PSO算法早熟、局部收敛速度慢的缺点,因此它能够高效地完成最优基因子集的确定,同时提高癌症分类正确率.
Gene expression data has a high application value for understanding the pathogene- sis, disease diagnosis and gene-level drug development. However, the microarray data usually contains thousands of genes with a small number of samples, which causes serious curse of dimensionality and deteriorates the diagnosis accuracy. Moreover, it gives raise to difficulty to a lot of classifiers, and cuts down the cost of medical diagnosis. A new gene selection method is proposed, which is based on clustering and particle swarm optimization (PSO). Firstly, parti- tion the genes using clustering algorithm and the useful are clusters selected for classification. Then the wrapper selection method based on particle swarm optimization(PSO) and extreme learning machine(ELM) is used to select the compact gene subset with high classification accu- racy from the genes selected before. This method take advantages of clustering and PSO algo- rithm, and it can perform better in classification than other classical methods.