针对如何有效分析高通量SELDI—TOF质谱数据以及筛选与肿瘤相关的蛋白质位点,提出一种基于近邻传播聚类分析的特征选择方法。首先利用t—test对SELDI数据进行初筛,然后利用近邻传播聚类分析以及零空间LDA对数据进行降维和去相关处理,最后采用SVM—RFE进行特征选择,筛选出与肿瘤判别相关的蛋白质位点。利用SVM、KNN、NB及J4.8等4个分类器,估算算法的分类性能。结果表明,在卵巢癌公共数据集OC—WCX2a和OC—WCX2b以及浙江省肿瘤医院乳腺癌数据集BC.WCX2a上显示该算法,在上述3个数据集中分类率分别达到96.43%、99.66%、90.88%,敏感性分别达到97.00%、100%、96.17%,特异性分别达到95.85%、99.08%、81.92%,并分别挑选出与肿瘤判别相关的10个蛋白位点。所提出的算法能够获得较好的分类率,有效提取出具有较好判别效果的蛋白质谱位点,有助于癌症的辅助诊断。
To analysis high throughput and high resolution mass spectrometry data effectively and capture the cancer related protein feature from the mass spectrometry data, diagnosis called a feature selection based on affinity propagation clustering of mass spectrometry was proposed in this paper. Firstly, the t-test was used on mass spectrometry data, followed by feature selection based on affinity propagation clustering. Next, affinity propagtion clustering and NS-LDA was used for reducing dimensions and correlation. Thirdly, SVM-RFE was used to select the features. Finally, we used four classifiers to estimate the performance of the algorithm. The proposed method was tested and evaluated on the ovarian cancer database OC-WCX2a, OC-WCX2b, and breast cancer database BC-WCX2a. Classification was achieved 96.43 % , 99.66 % and 90. 88 % , sensitivity was achieved 97.00 %, 100 % and 96. 17 %, specificity was achieved 95.85 %, 99.08 % and 81.92 %, respectively. And 10 m/z features were selected for each dataset. The experimental results showed good performance of the method, and the method is expected to be used in cancer diagnosis.