目的利用改进稀疏非负矩阵分解技术对乳腺癌基因表达谱数据进行双向聚类,挖掘与乳腺癌发病密切相关的基因及其生物过程。方法用小波对22283个基因的人乳腺癌基因表达数据进行去噪,然后通过T统计初步筛选出5067个基因作为改进稀疏非负矩阵的输入矩阵,进行双向聚类进一步筛选出81个与乳腺癌密切相关的显著基因,最后通过cytoseape对81个与乳腺癌密切相关的显著基因构建生物过程结构图。结果筛选出与乳腺癌相关的基因、可能相关的基因以及这些基因参与的生物过程之间的关系。结论改进稀疏非负矩阵分解与现存的其他非负矩阵分解算法相比具有聚类效果好、稳定性强且迭代次数少的优点,适合于乳腺癌差异表达基因的提取。
Objective To biocluster breast cancer gene expression profiles by improved sparse non-negative matrix factorization( sparse non-negative matrix factorization, SparseNMF), and to dig out the related genes and biological processes of breast cancer. Methods With wavelet to preprocess 22 283 human breast cancer gene expression profiles data for removing noise by T test screening out 5 067 genes preliminary, then to chose 81 significant genes of breast cancer by improved SparseNMF bioclustering, then construct the biological processes structure where the 81 significant genes involved. Results The significant genes, related gene of breast cancer and these genes invol- ving in biological processes were screened out. Conclusion The proposed sparse NMF algorithm often achieves better clustering performance and stability with shorter computing time to other existing NMF algorithms, fitting to extract breast cancer significant genes.