东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

开放内容空间的Spare页面侦测

ISSN号：1674-3644
期刊名称：《武汉科技大学学报：自然科学版》
时间：0
分类：TP182[自动化与计算机技术—控制科学与工程;自动化与计算机技术—控制理论与控制工程]
作者机构：[1]北京航空航天大学经济管理学院,北京100083, [2]Center for Advanced Analytics and Business Intelligence, Texas Tech University,Lubbock,TX USA, 79410
相关基金：国家自然科学基金资助项目（70671007）.

关键词：开放内容, Anti—Spam, 知识发现, open content, anti-spam, knowledge discovery

中文摘要：

基于链接分析自动侦测Spam页面，提出了一个分阶段机制。采用决策树和链接分析模型对Wikipedia中的所有节点进行Indegree和Outdegree检测，从而产生出一个候选列表，并引入一个启发算法来降低第一类型的错误。设计一个分类器用于分类候选列表，采用TrustRank和SpamRank算法分别从信任种子集和Spam种子集中推算系统页面各自可信概率和Spam概率，从而减少第二类型的错误。然后将产生的候选集合推送至页面编辑，根据编辑判断的结果反馈训练模型，调整权重。结果表明，分阶段侦测模型可自动地侦测Spam页面，其查准率和查全率分别达到78．3％和94％。

英文摘要：

Web spare has been attacking Internet applications from search engines to open contents and hindered their performance. To combat web spam, Wikipedia decided to add the ＂NOFOLLOW＂ attribute to all external links. Unfortunately, this double-edged ＂sword＂ has also reduced the value of a Wikipedia link and dampened the passion of contributors. Based on data mining techniques, we proposd a method, two-stage link spam filtering （TLSF）, which can efficiently and effectively detect web spare. At the first stage, TLSF focuses on the structure of the whole Wikipedia and generates a candidate list of spam pages by inspecting Indegree, Outdegree, and the probability of link farm. At the second stage, a classifier is used to detect web spare from the candidates generated in the first stage. TrustRank and SpamRank are combined into the second stage to condense the results and wipe off nospam pages listed mistakenly. Then the weighing coefficient in the two-step model is trained On the basis of the feedback of the affirmation processing. The experiment, implemented in SAS 9.1.3, uses the link data collected from Wikipedia. org/en with 100＋ million observations. The results demonstrate that TLSF can effectively raise the accuracy of spare page detection.

同期刊论文项目