k-均值聚类算法是一种广泛应用于基因表达数据聚类分析中的迭代变换算法,它通常用距离法来表示基因间的关系,但不能有效的反应基因间的相互依赖的关系。为此,提出基于信息论的k-modes聚类算法,克服了以上缺点。另外,还引入了伪F统计量,一方面,可以对空间中有部分重叠的点进行有效的分类;另一方面,可以给出最佳聚类数目,从而弥补了k-modes聚类法的不足。使其成为一种非常有效的算法,从而达到较优的聚类效果。
K- means clustering algorithm is an iterative transformation algorithm which is widely applied in gene expression data clustering analysis, it measures the relationship between genes by distance, but which can not reflect the interdependence relationship of genes effectively. For this, an attribute clustering algorithm - k - modes based on information theory was proposed, which overcomes the demerits mentioned above. In addition, we have also introduced pseudo F - statistics, on the one hand, some of the overlapping points in space realizes effective classification; on the other hand, it can give the best clustering number, thereby making up for the shortage of k - modes clustering method. All of these merits made the proposed method very effective to achieve optimum clustering effect.