针对微博中充斥着的大量广告信息和其它的噪声微博,本文提出了基于C4.5决策树分类算法的用户分类过滤机制和基于特征值的计分过滤方法。利用微博文本的实时性和微博话题的时效性,还提出了一个基于时间参数的相似度计算方法。实验结果表明,该方法能提高对噪声过滤和话题检测的准确率和效率。
Aiming at the big amount of advertising messages and other noise tweets, the paper proposed a user classiifcation ifltering mechanism based on C4.5 Decision Tree Classiifcation Algorithm and a scoring ifltering method based on characteristic value. Taking advantage of the instantaneity of micro-blog text and timeliness of micro-blog topic, the paper put forward a similarity calculation method based on time parameter. Experiments showed that this mechanism could detect topics and iflter noise with better accuracy and efifciency compared to the traditional approach.