基于有监督的虚假评论检测方法受限于标注语料的规模,为了更好地利用未标注评论数据来提高分类器的正确率和泛化能力,本文提出一种基于半监督主动学习的虚假评论检测方法.首先,定义并提取评论内容特征以及评论者行为特征,结合这两类特征来对虚假评论进行检测.然后,采用基于熵的主动学习算法选择对学习最有帮助的评论样本,获得其类别标注,将其合并到基于Tri-training的半监督学习算法的训练集中,利用大量未标注评论数据进行学习,提升分类器性能.最后,在领域评论数据集上进行实验,结果表明,将半监督学习与主动学习相结合,能够更有效的利用未标注评论数据,从而有效地提高虚假评论检测的效果.
Detection of fake reviews based on supervision is limited by the size of the annotation corpus. In order to make better use of unlabeled review data to improve the classifier's accuracy and generalization ability,a fake review detection method based on semi-supervised active learning is proposed in this paper. Firstly,review content features and reviewers' behavioral features are defined,extracted and combined to detect fake reviews. Secondly,entropy-based active learning algorithm is utilized to select the most helpful review samples for learning,and to obtain their labeled categories that will be merged into the semi-supervised learning training set based on Tri-training algorithm,which exploits a large number of unlabeled reviews to learn and improves the performance of the classifier. Finally,a test is carried out on domain review datasets. The experimental results show that the combination of semi-supervised learning and active learning takes effective advantage of the unlabeled reviews to improve the detection.