为了提高垃圾邮件样本的覆盖率和实时性,降低垃圾邮件过滤系统的计算复杂性和滞后性,提出了基于垃圾邮件发送的行为特征,采用蜜罐原理进行垃圾邮件样本采集.引入蜜罐帐户评价公式,根据这个公式设计并实现了蜜罐帐户选择算法,动态地在电子邮件服务器中选择一定数量的帐户作为蜜罐并生成蜜罐集合,定期从蜜罐集合中采集邮件样本,作为过滤系统的学习语料.实验表明,利用该方法能够使采集到垃圾邮件样本覆盖率达到98%以上.由于系统能够定期地进行样本采集,因此实时性较强,从而提高系统过滤垃圾邮件的能力.
In order to improve the coverage rate and gain the real time property of the corpus used by filtering system as well as to reduce the computing complexity and hysteretic behavior, a new method for spam sample collection is proposed, which is based on the honeypot technology and the behavior characteristics of spam. An algorithm, on the basis of a honeypot-account evaluation formu- la, is designed to select the accounts in e-mail system as honeypot and dynamically build a set of honeypot-accounts. Spam samples are collected from this set of honeypots using the algorithm. Results show that the sample coverage rate can reach up to 98% and real time property can be obtained using this approach for collecting corpus, which as a result can improve the performance of the filtering system.