基于链接分析自动侦测Spam页面,提出了一个分阶段机制。采用决策树和链接分析模型对Wikipedia中的所有节点进行Indegree和Outdegree检测,从而产生出一个候选列表,并引入一个启发算法来降低第一类型的错误。设计一个分类器用于分类候选列表,采用TrustRank和SpamRank算法分别从信任种子集和Spam种子集中推算系统页面各自可信概率和Spam概率,从而减少第二类型的错误。然后将产生的候选集合推送至页面编辑,根据编辑判断的结果反馈训练模型,调整权重。结果表明,分阶段侦测模型可自动地侦测Spam页面,其查准率和查全率分别达到78.3%和94%。
Web spare has been attacking Internet applications from search engines to open contents and hindered their performance. To combat web spam, Wikipedia decided to add the "NOFOLLOW" attribute to all external links. Unfortunately, this double-edged "sword" has also reduced the value of a Wikipedia link and dampened the passion of contributors. Based on data mining techniques, we proposd a method, two-stage link spam filtering (TLSF), which can efficiently and effectively detect web spare. At the first stage, TLSF focuses on the structure of the whole Wikipedia and generates a candidate list of spam pages by inspecting Indegree, Outdegree, and the probability of link farm. At the second stage, a classifier is used to detect web spare from the candidates generated in the first stage. TrustRank and SpamRank are combined into the second stage to condense the results and wipe off nospam pages listed mistakenly. Then the weighing coefficient in the two-step model is trained On the basis of the feedback of the affirmation processing. The experiment, implemented in SAS 9.1.3, uses the link data collected from Wikipedia. org/en with 100+ million observations. The results demonstrate that TLSF can effectively raise the accuracy of spare page detection.