传统的分类器仅使用有标签的数据进行训练,然而,有标签的实例通常因昂贵、耗时而难以获得,从而造成标注瓶颈问题.半监督学习通过大量的无标签数据与有标签数据相结合来创建性能良好的分类器,从而解决标注瓶颈问题.由于半监督的学习需要较少的人工介入,而精确率又较高,因此无论在理论上还是实践上都具有意义.本文在对已有的半监督学习算法进行研究的基础上,针对有标签数据相当少时,无法使用统计方法进行标注置信度评价的情况,提出了基于kNN和SVM的二阶段协同学习,实验证实该方法是有效的.
Traditional classifiers are only learning labeled data, but labeled instances are difficult to be acquired because they are time-consuming and costly. This is so-called bottleneck problem of annotation. Semi-supervised learning creates better classifier by using both labeled and unlabelled data to resolve this problem. Because semi-supervised learning needs less manual work and its accuracy is higher, it is meaningful both in theory and in practice. When labeled data is not sufficient, it is infeasible to evaluate the confidence of label assigned to the unlabelled data by statistical theory. After the survey of known semi-learning algorithms, we propose 2-phase based co-training which uses kNN and SVM classifier simultaneously. Experiment shows that the proposed algorithm is effective.