首先将垃圾网页特征分为两个不同的视图,即基于内容特征的视图和基于链接特征的视图,利用典型相关分析及其相关改进方法进行特征提取,生成两组新的特征;再对新生成的两视图特征采用不同组合方式产生单视图数据,并用这组数据作为训练数据构建分类算法。实验结果表明,将垃圾网页看成两视图数据,并应用多视图典型相关分析技术,可有效提高垃圾网页的识别精度。
Firstly this paper divided the features of Web spam pages into the content feature based view and the link feature based view. And it employed canonical correlation analysis and promotion methods for feature extraction to generate two new feature sets for each Web page. Then it implemented different combinations of the two new feature sets of Web pages to pro- duce a single view for Web pages, which used to construct classification algorithms. Experimental resuhs show that considering Web page data as two view data and applying multi-view canonical correlation analysis techniques can effectively improve the recognition accuracy of Web spare.