东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于聚类的垃圾邮件识别技术研究

期刊名称：山东大学学报(理学版)
时间：2011.5.5
页码：71-76
分类：TP311[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术]
作者机构：[1]广东外语外贸大学信息学院,广东广州510420, [2]广东外语外贸大学国际工商管理学院,广东广州510006, [3]海军工程大学理学院,湖北武汉430033
相关基金：国家自然科学基金资助项目（61070061）;;广东省自然科学基金资助项目（9151026005000002）;;广东省高层次人才项目;;广东外语外贸大学研究生创新团队项目（10GWCXTD-08）
相关项目：面向不平衡数据的学习算法及应用研究

关键词：垃圾邮件识别, k最近邻文本分类, 一趟聚类算法, 增量式建模, spam detection； kNN text categorization； single pass clustering； incremental modeling；

中文摘要：

随着垃圾邮件数量日益攀升,如何有效识别垃圾邮件已成为一项非常重要的课题。为克服k最近邻（k-nea-rest neighbor,kNN）分类法在垃圾邮件识别中的缺陷,本文基于聚类算法提出了一种改进kNN识别方法。首先使用基于最小距离原则的一趟聚类算法将训练邮件集合划分为大小几乎相同的超球体,每个超球体包含一个类别或多个类别的文本;其次,采用投票机制对得到的聚类结果进行簇标识,即以簇中最多文本的类别作为簇的类别,得到的识别模型由具有标识的簇组成;最后,结合最近邻分类思想,对输入的邮件进行自动识别。实验结果表明,该方法可大幅度地降低邮件相似度的计算量,较TiMBL、Nave Bayesian、Stacking等算法效果要好。同时,该方法是一种可增量式更新识别模型的方法,具有一定的实用性。

英文摘要：

With the surge of email spam,how to detect it becomes an important and urgent problem.To cope with the defects of kNN spam detection,an improved kNN spam detection approach based on clustering is proposed.First,by using the least distance principle,the training email text samples are divided into several hyper spheres with the approximate radius,and the texts contained in hyper spheres are from one or more of these categories.Second,the clusters（hyper spheres） are tagged by using the majority voting mechanism,which means that each cluster is tagged with the category containing the most text in the cluster,and the detection model consists of tagged clusters.Finally,the email texts are detected with the kNN approach.Experimental results show that the proposed approach can substantially reduce the text similarity computation,and perform better than iMBL,Nave Bayesian,and Stacking.Furthermore,the detection model constructed by the proposed approach can be incrementally updated,which has great feasibility in real-world applications.

同期刊论文项目