近年来,大量半监督分类算法被提出。然而在真实的学习任务中,研究者很难决定究竟选择哪一种半监督分类算法,而在这方面并没有任何指导。半监督分类算法可通过数据分布假设进行分类。为此,在对比分析采用不同假设的半监督分类典型算法的基础上,以最小二乘方法(Least Squares,LS)为基准,研究比较了基于聚类假设的转导支持向量机(Transductive Support Vector Machine, TSVM)和基于流行假设的正则化最小二乘法(Laplacian Regularized Least Squares Classification,LapRLSC),并同时利用两种假设的SemiBoost以及无任何假设的蕴含限制最小二乘法( Implicitly Constrained Least Squares,ICLS)的分类效果。得出的结论为,在已知数据样本分布的情况下,利用相应假设的方法可保证较高的分类正确率;在对数据分布没有任何先验知识且样本数量有限的情况下,TSVM能够达到较高的分类精度;在较难获得样本标记而又强调分类安全性时,宜选择ICLS,而LapRLSC也是较好的选项之一。
Large amounts of semi-supervised classification algorithms have been proposed recently, however, it is really hard to decide which one to use in real learning tasks, and further there is no related guidance in literature. Therefore, empirical comparisons of several typical algorithms have been performed to provide some useful suggestions. In fact, semi-supervised classification algorithms can be cate- gorized by the data distribution assumption. Therefore, typical algorithms with different assumption adoptions have been contrasted. Spe- cifically, they are Transductive Support Vector Machine (TSVM) using the cluster assumption, Laplacian Regularized Least Squares Classification (LapRLSC) using the manifold assumption, and SemiBoost using both assumptions, and Implicitly Constrained Least Squares (ICLS) without any assumption, with the supervised least Squares Classification (LS) as the base line. Eventually it is conclu- ded that when data distribution is given, the semi-supervised classification algorithm that adopts corresponding assumption can lead to the best performance;without any prior knowledge about data distribution, TSVM can be a good choice when the given labeled samples are extremely limited; when the labeled samples are not so scarce, and meanwhile if learning safety is emphasized, ICLS is proposed, and La- pRLSC is another good choice.