为了有效地检测垃圾网页,通过分析网页内容特征和链接特征的分布,发现正常网页特征分布有规律而垃圾网页特征分布散乱,根据正常网页特征分布与垃圾网页特征分布的不同,提出了用分布函数拟合正常网页特征分布,并计算正常网页和垃圾网页比例与分布函数的差值,以差值为阈值使用C4.5决策树对垃圾网页进行检测。实验结果表明,该方法能够有效地减少被错误分类的正常网页,提高准确率。
Web spam disturbs users to obtain information normally and to detect spam pages effectively,distribution of web content features and linked features are analyzed.The result shows that normal web features distribute regular but spam web features distribute scattered.Based on the difference distribution,function to fit the distribution of normal web features is employed,and the difference between web proportion and the distribution function is calculated.Finally,C4.5 decision tree is constructed to detect spam pages with difference as threshold.The experimental results show that it can detect spam pages effectively.