垃圾邮件过滤就是对邮件做出是垃圾或非垃圾的判断。传统的表示邮件的方法是在向量空间模型基础上通过信息增益等特征选择方法提取一部分词来表示邮件内容,存在语义信息不足的问题。该文提出一种将传统方法和词共现模型结合起来表示邮件特征的新方法,再采用交叉覆盖算法对邮件进行分类得到邮件分类器。实验表明,该文提出的邮件过滤算法与传统方法相比提高了过滤性能,词共现选择的维度要比传统方法选择的维度更具有代表性。
The aim of spam filtering is to distinguish the spam and the ham. The traditional methods used vector space model and feature selection approaches to extract features representing the contents of emails. However, these methods do not take the semantic information among words into account. In this paper, a new method is proposed to extract email features by combining the vector space model and the term co-occurrence, The covering algorithm is then employed to classify emails. Experiments show that the proposed method significantly improves the filtering performances compared with traditional ones. The features selected by utilizing term co-occurrence model are more representative than those chosen by the vector space model.