微博作为目前国内外最活跃的信息分享平台之一,其中却充斥着大量的垃圾内容。因此,如何从给定话题的微博数据中,过滤掉与话题不相关的垃圾微博、保留话题相关微博,成为迫切需要解决的问题。该文提出了一种半监督的中文微博过滤方法,基于朴素贝叶斯分类模型和最大期望算法,实现了利用少量标注数据的垃圾微博过滤算法,其优势是仅仅利用少量标注数据就可以获得较为理想的过滤性能。分别对十个话题140 000余条新浪微博数据进行过滤,该文提出的模型准确度和F值优于朴素贝叶斯和支持向量机模型。
Microblogging sites are one of the most popular information sharing platforms today. However, among the large amount of posted published every day, spare texts are seen everywhere: users utilize spam posts to advertise, broadcast, boast their own products, and defame their competitors. Therefore, filtering spare tweets is a criti- cal and fundamental problem. In this paper, we propose a semi-supervised algorithm based on Expectation Maximization and Naive Bayesian Classifier (EM-NB), which is able to filter spam tweets effectively using only a small amount of labeled data. The experimental results on more than 140 thousand tweets from Sina Weibo show that our method achieves higher accuracy and F-score than baselines.