朴素贝叶斯算法是一种常见的基于内容的垃圾邮件过滤算法,但是,传统朴素贝叶斯过滤存在判断内容的不确定性和邮件表示不完整性等问题。分析邮件信头各域在正常邮件和垃圾邮件中表现出的不同属性,提取非特征信息,结合特征信息和非特征信息改进朴素贝叶斯算法。实验结果表明,改进的朴素贝叶斯分类方法与单纯使用特征信息的方法相比,垃圾邮件的召回率和准确率更高,凸显了该方法涵盖邮件信息、克服内容判断缺陷的优势。
Nave Bayes algorithm was widely used in the content-based filtering,but traditional Nave Bayes faced many problems,such as the uncertainty of classifying e-mails by analyzing e-mail content,the incompleteness of e-mail representation.In order to overcome these shortcomings,this paper analyzed different attributes between ham e-mail header and spam e-mail header,extracted noncharacteristic information,and improved Nave Bayes algorithm which combined feature information with noncharacteristic information.Experimental results show that the improved Nave Bayes classification approach increases the recall and the precision of spam,covers e-mail information,and makes up for the shortage of content-based filtering,compared with that of only using feature information.