东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于PU学习算法的虚假评论识别研究

ISSN号：1000-1239
期刊名称：计算机研究与发展
时间：2015.3.1
页码：639-648
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]武汉大学计算机学院,武汉430072
相关基金：国家自然科学基金项目（61133012,61173062,61373108））; 国家哲学社会科学重大计划招标项目（11＆ZD189）
相关项目：汉语文本推理的资源建设和统计分析研究

关键词：虚假评论, 全监督学习, PU学习, 狄利克雷过程混合模型, 多核学习, deceptive reviews, supervised learning, positive and unlabeled（PU）learning, Dirichlet process mixture model（DPMM）, multiple kernel learning（MKL）

中文摘要：

识别虚假评论有着重要的理论意义与现实价值.先前工作集中于启发式策略和传统的全监督学习算法.最近研究表明：人类无法通过先验知识有效识别虚假评论,手工标注的数据集必定存在一定数量的误例,因此简单使用传统的全监督学习算法识别虚假评论并不合理.容易被错误标注的样例称为间谍样例,如何确定这些样例的类别标签将直接影响分类器的性能.基于少量的真实评论和大量的未标注评论,提出一种创新的PU（positive and unlabeled）学习框架来识别虚假评论.首先,从无标注数据集中识别出少量可信度较高的负例.其次,通过整合LDA（latent Dirichlet allocation）和K-means,分别计算出多个代表性的正例和负例.接着,基于狄利克雷过程混合模型（Dirichlet process mixture model,DPMM）,对所有间谍样例进行聚类,混合种群性和个体性策略来确定间谍样例的类别标签.最后,多核学习算法被用来训练最终的分类器.数值实验证实了所提算法的有效性,超过当前的基准.

英文摘要：

Identifying deceptive reviews has important theoretical meaning and practical value.While previous works focus on some heuristic rules or traditional supervised methods.Recent research has shown that humans cannot directly identify deceptive reviews by their prior knowledge.Humanannotated dataset must contain some mislabeled examples.Due to the difficulty of human labeling needed for supervised learning,the problem remains to be highly challenging.There are some ambiguous reviews（we call them spy examples）,which are easily mislabeled.The key of identifying deceptive review is how to deal with these spy reviews.Based on some truthful reviews and a large amount of unlabeled reviews,a novel approach,called mixing population and individual nature PU learning,is proposed.Firstly,some reliable negative examples are identified from the unlabeled dataset.Secondly,some representative positive examples and negative examples are generated by integrating latent dirichlet allocation and K-means.Thirdly,all spy examples are clustered into many groups based on dirichlet process mixture model,and two schemes（population nature and individual nature）are mixed to determine the category label of spy examples.Finally,multiple kernel learning is presented to build the final classifier.Experimental results demonstrate that our proposed methods can effectively identify deceptive reviews,and outperform the current baselines.

同期刊论文项目