随着垃圾邮件数量日益攀升,如何有效识别垃圾邮件已成为一项非常重要的课题。为克服k最近邻(k-nea-rest neighbor,kNN)分类法在垃圾邮件识别中的缺陷,本文基于聚类算法提出了一种改进kNN识别方法。首先使用基于最小距离原则的一趟聚类算法将训练邮件集合划分为大小几乎相同的超球体,每个超球体包含一个类别或多个类别的文本;其次,采用投票机制对得到的聚类结果进行簇标识,即以簇中最多文本的类别作为簇的类别,得到的识别模型由具有标识的簇组成;最后,结合最近邻分类思想,对输入的邮件进行自动识别。实验结果表明,该方法可大幅度地降低邮件相似度的计算量,较TiMBL、Nave Bayesian、Stacking等算法效果要好。同时,该方法是一种可增量式更新识别模型的方法,具有一定的实用性。
With the surge of email spam,how to detect it becomes an important and urgent problem.To cope with the defects of kNN spam detection,an improved kNN spam detection approach based on clustering is proposed.First,by using the least distance principle,the training email text samples are divided into several hyper spheres with the approximate radius,and the texts contained in hyper spheres are from one or more of these categories.Second,the clusters(hyper spheres) are tagged by using the majority voting mechanism,which means that each cluster is tagged with the category containing the most text in the cluster,and the detection model consists of tagged clusters.Finally,the email texts are detected with the kNN approach.Experimental results show that the proposed approach can substantially reduce the text similarity computation,and perform better than iMBL,Nave Bayesian,and Stacking.Furthermore,the detection model constructed by the proposed approach can be incrementally updated,which has great feasibility in real-world applications.