微博空间内充斥着大量广告信息,这些广告信息对舆情分析造成了极不利的影响.分析广告型微博特点,提出了一种广告型微博识别方法:在传统文本特征的基础上,引入"非活跃期微博数"、"微博重复度"、"特征词对权重"三类特征,并结合支持向量机模型对微博文本进行分类,识别广告微博发布者;分析广告微博发布者与普通用户的差异,提取广告微博发布者的"主题"特征,并面向用户对微博文本进行过滤,实现对广告型微博的识别.实验结果正确率为87.6%,召回率为97.2%,F值为91.6%,证明该方法能高效准确地识别广告型微博.
Tbere exists large amount of advertising information which has adverse effect on web public opinion analysis in microblog space. Detecting the advertising microblogs, filtering the microblogs,is becoming an urgent problem. Having analyzed the features of microblog base on massive data, a detecting approach for advertising microblogs is proposed in this paper: add three new features named "word pair weight feature" ," multiplicity" and "post frequency" to the classification algorithm base on traditional text features and SVM model to detect the advertisers;analyze the difference between advertisers and legitimate users, extract the topic feature of every user, filter the microblogs facing users and accomplish the advertising microblog detection. The results based on this method can achieve 86. 7% precision,97. 2% recall and 91.6% F-score. It shows that our method can effectively detect the advertising microb- logs.