位置:成果数据库 > 期刊 > 期刊详情页
Analysis on the Content Features and Their Correlation of Web Pages for Spam Detection
  • ISSN号:1673-5447
  • 期刊名称:China Communications
  • 时间:2015.3.10
  • 页码:85-95
  • 分类:TP393.098[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术] TP393.092[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
  • 作者机构:[1]Department of computer science, Shandong Normal University, Jinan 250014, China, [2]Shandong Provincial key laboratory for novel distributed computer software technology
  • 相关基金:supported by the National Science Foundation of China(No.61170145,61373081); the Specialized Research Fund for the Doctoral Program of Higher Education of China(No.20113704110001); the Technology and Development Project of Shandong(No.2013GGX10125); the Taishan Scholar Project of Shandong,China
  • 相关项目:基于特征建模优化与判别学习的Web spam识别技术研究
中文摘要:

In the global information era,people acquire more and more information from the Internet,but the quality of the search results is degraded strongly because of the presence of web spam.Web spam is one of the serious problems for search engines,and many methods have been proposed for spam detection.We exploit the content features of non-spam in contrast to those of spam.The content features for non-spam pages always possess lots of statistical regularities; but those for spam pages possess very few statistical regularities,because spam pages are made randomly in order to increase the page rank.In this paper,we summarize the regularities distributions of content features for non-spam pages,and propose the calculating probability formulae of the entropy and independent n-grams respectively.Furthermore,we put forward the calculation formulae of multi features correlation.Among them,the notable content features may be used as auxiliary information for spam detection.更多还原

英文摘要:

In the global information era,people acquire more and more information from the Internet,but the quality of the search results is degraded strongly because of the presence of web spam.Web spam is one of the serious problems for search engines,and many methods have been proposed for spam detection.We exploit the content features of non-spam in contrast to those of spam.The content features for non-spam pages always possess lots of statistical regularities; but those for spam pages possess very few statistical regularities,because spam pages are made randomly in order to increase the page rank.In this paper,we summarize the regularities distributions of content features for non-spam pages,and propose the calculating probability formulae of the entropy and independent n-grams respectively.Furthermore,we put forward the calculation formulae of multi features correlation.Among them,the notable content features may be used as auxiliary information for spam detection.

同期刊论文项目
同项目期刊论文
期刊信息
  • 《中国通信:英文版》
  • 中国科技核心期刊
  • 主管单位:中国科学技术协会
  • 主办单位:中国通信学会
  • 主编:刘复利
  • 地址:北京市东城区广渠门内大街80号6层608
  • 邮编:100062
  • 邮箱:editor@ezcom.cn
  • 电话:010-64553845
  • 国际标准刊号:ISSN:1673-5447
  • 国内统一刊号:ISSN:11-5439/TN
  • 邮发代号:2-539
  • 获奖情况:
  • 国内外数据库收录:
  • 被引量:187