Web spam是指采用某些技术手段,使得网页在搜索引擎检索结果中的排名高于其应得排名的行为,它严重影响搜索结果的质量。考虑到Web spam数据集的严重不平衡情况,本研究提出先使用SMOTE过抽样方法平衡数据集,再利用随机森林算法训练分类器。通过对常见的单分类器和集成学习分类器的对比实验,发现SMOTE+RF方法表现较为突出,并根据实验结果优化了方法中的重要参数,对使用SMOTE方法后AUC值提高的原因进行了分析。在WEBSPAM UK2007数据集上的实验证明,该方法可以显著提高分类器的分类效果,其AUC值已经超过了Web Spam Challenge 2008上的最好成绩。
Web spam refers to the actions intended to mislead search engines into ranking some pages higher than they deserved, which could significantly deteriorate the quality of searching results. Considering the serious imbalance of the Web spam dataset, it was proposed to use over-sampling method SMOTE to balance the dataset, then to train the classi- fiers with random forests algorithm. The results showed that the SMOTE + RF method was more prominent by means of experimental comparison with the conventional single classifiers and the ensemble learning classifiers. The important pa- rameters of this method were optimized based on experimental results, and the reasons for the improvement of the AUC value after using SMOTE were also analyzed. Experimental results on WEBSPAM UK2007 dataset showed that this method could markedly improve the performance of the classifiers, of which the AUC value could exceed the best result of Web Spam Challenge 2008.