针对原始标记传播算法迭代次数过多和阈值选取的不确定性等问题,提出一种改进的标记传播算法,并将其应用于基因表达谱数据分析。首先将高维基因表达谱数据表示为权值矩阵,同时定义一个表示样本类别属性的标记序列,并将其中少量样本标记为已知;然后利用根据Gauss-Seidel迭代算法推导出的迭代公式更新标记序列,并证明标记序列的解的收敛性;最后采用正负标记的方式,根据标记序列各分量的符号差异实现数据类别的划分。通过白血病和结肠癌数据集实验,证明了本文方法的有效性。
To tackle problems such as excessive iterative times and indeterminate thresholds of original label propagation algorithm, an improved label propagation method was presented with the application in the analysis of gene expression profile data. First, a weighted matrix was constructed with gene expression profile data. Meanwhile, the label sequence indicating the class information was defined, where several samples were marked as labeled data. Then, the label sequence was updated by an iterative formula which inspired from Gauss-Seidel iteration and the solution of the label sequence was proved to be converged. Finally, the clustering problem was solved using plus-minus label which was on the basis of the signs of the label sequence. Experiments on the leukemia and colon cancer data show that the proposed method is feasible and effective.