针对现有贝叶斯算法应用于垃圾邮件过滤时,贝叶斯贝努利模型对邮件文本特征向量进行处理不能区分特征向量的重要性,导致邮件分类召回率低,同时还存在合法邮件被误判的风险的问题,采用贝叶斯多项式模型对特征向量进行加权处理来区分特征向量的重要性;然后,采用低风险策略来降低合法邮件被误判的风险,提出基于多项式模型和低风险的贝叶斯垃圾邮件过滤算法。实验结果表明:对于不同数量的特征项,该算法能够有效提高邮件分类的正确率与召回率,降低合法邮件被误判的风险,并在过滤文本字符数量较大的邮件时,具有性能平稳、波动小的特点。
Existing Bayesian algorithms use Bernoulli model to process text features in the application to spam filtering,which does not distinguish the varying importance of various features,leading to a low recall rate in mail classification.In addition,existing Bayesian algorithms also have the risk of mis-judging legitimate mail.A Bayesian spam filtering algorithm was proposed based on the polynomial model and the low risk.The algorithm measures the weight of text features to distinguish their importance in mail classification,and then compares the probabilities that a mail respectively fall into the spam class or the normal mail class.The results show that this algorithm effectively improves the recall and precision rate of mail classification,and reduces the risk of mis-judging legitimate mail.Additionally,the algorithm is of smooth and little fluctuation when filtering mails with a large number of text characters.