为了避免倾向于高频词的信息增益(information gain,IG)方法忽略各类别间的相似性特点,提出了一种基于特征分布的选择方法对IG进行修正,使真正拥有高类别区分信息的特征项被保留.同时,对最大期望值(expectation maximization,EM)算法的效率低下问题加以改进,将拥有较高后验类别概率的未标注文档逐步从未标注文档集转至已标注文档集,有效减少算法迭代次数.测试结果表明,基于特征分布的半监督学习方法在Reuter-21578和Epinion.com两个不同特点的数据集上都取得了较好的分类效果和性能.
It is crucial for semi-supervised learning(SSL) to cut down the dimension of the feature space through feature selection.The popular information gain(IG) selection method,which inclines to high frequency words,always ignores similarity of classes.Thus,the classification performance of characteristics IG is unstable.This paper puts forward a feature distribution selection to help IG retain features possessing high categories discriminative information.To solve the inherent efficiency problem of the expectation maximization(EM) algorithm,unlabeled documents that possess maximum posterior category probability are transferred from unlabeled collection to labeled collection.The iteration number of the improved EM is obviously reduced.Finally,experimental evaluation on Reuter-21578 and Epinion.com with two different data sets shows that the semi-supervised learning method using feature distribution obtains very effective performance for micro average F1 criterion.