微博中隐含着舆论热点等与特定话题相关的有价值的信息。因此,针对微博数据分析(如话题发现等)的工作成了当前的研究热点。由于微博内容和形式的高度自由,使得相关的研究工作面临着垃圾数据噪声大、有用数据提取难的问题。然而,目前针对非公共话题的中文垃圾微博过滤尚无有效方法。提出一种基于多视角特征融合的垃圾微博过滤方法。该方法首先从微博的结构和内容两个视角建立规则,再与微博文本分词结果进行融合构造复合特征,并以此对垃圾微博进行过滤。通过在真实数据集上的实验表明多视角融合的特征使得过滤效果有明显提升。
As microblog contains valuable information, data analysis on microblog such as topic detection has become a research hotspot. Due to the high flexibility of microblog's content and form, noisy data is a big challenge for microblog analysis. Therefore, no effective method has been developed for non-public topic Chinese spam microblog filtering until now. To fill this gap, a new method was proposed to fuse multi-angle features extracted from both the content and struc- ture of microblog. The fused features were then employed for filtering spam microblog with classifiers. Experiments on real data demonstrate that the fusion of multi-angle features can effectively improve the performance of spam filtering.