几个典型监督聚类的方法象 Gaussian 混合那样基于模型聚类监督(GMM ) , k-nearest-neighbor (KNN ) ,二进制代码支持向量机器(SVM ) ,多,类支持向量机器(MC-SVMs ) 被采用分类计算机模拟数据和二个真实微数组表达式数据集。假积极,假否定,真积极,真否定,聚类精确性和马修的相关系数(MCC ) 在这些方法之中被比较。结果如下:(1 ) 在分类几千个基因表示数据,二个 GMM 方法的表演根据微数组数据的整个集合是有限混合物的假设有最大的聚类精确性和最少外套 FP+FN 错误数字多变量 Gaussian。而且,当训练样品的数字是很小的时, GMM-II 方法的聚类的精确性在 GMM-I 方法上有优势。( 2 )一般来说, MC-SVMs 的优异分类表演更柔韧、更实际,它对维数的诅咒不太敏感,并且不仅靠着在聚类精确性到几千个基因表示数据的 GMM 方法,而且对高度维的基因表示样品的一个小数字更柔韧比另外的技术。(3 ) 在大样本尺寸上更好 MC-SVMs, OVO 和 DAGSVM 表现,而五个 MC-SVMs 方法在中等样品容量上有很类似的性能。当样品容量是小的时,在另外的情况,改写, WW 和 CS 中,产量更好导致。那么,这被推荐至少二个候选人方法,根据真实数据特征和试验性的条件选择,应该被执行并且与相比获得更好的聚类结果。
Several typical supervised clustering methods such as Gaussian mixture model-based supervised clustering (GMM), k- nearest-neighbor (KNN), binary support vector machines (SVMs) and multiclass support vector machines (MC-SVMs) were employed to classify the computer simulation data and two real microarray expression datasets. False positive, false negative, true positive, true negative, clustering accuracy and Matthews' correlation coefficient (MCC) were compared among these methods. The results are as follows: (1) In classifying thousands of gene expression data, the performances of two GMM methods have the maximal clustering accuracy and the least overall FP+FN error numbers on the basis of the assumption that the whole set of microarray data are a finite mixture of multivariate Gaussian distributions. Furthermore, when the number of training sample is very small, the clustering accuracy of GMM-Ⅱ method has superiority over GMM- Ⅰ method. (2) In general, the superior classification performance of the MC-SVMs are more robust and more practical, which are less sensitive to the curse of dimensionality, and not only next to GMM method in clustering accuracy to thousands of gene expression data, but also more robust to a small number of high-dimensional gene expression samples than other techniques. (3) Of the MC-SVMs, OVO and DAGSVM perform better on the large sample sizes, whereas five MC-SVMs methods have very similar performance on moderate sample sizes. In other cases, OVR, WW and CS yield better results when sample sizes are small. So, it is recommended that at least two candidate methods, choosing on the basis of the real data features and experimental conditions, should be performed and compared to obtain better clustering result.