Email自动分类已成为半结构化文本信息自动处理的研究热点。本文在时已有Email自动分类方法深入研究的基础上,提出了一种基于SVM和领域综合特征的Email自动分类方法。主要包括:一是将SVM引入到Email自动分类研究中,并对SVM学习算法中的核函数和参数选择进行了探讨;二是鉴于词频的特征表示方法难以准确表示Email主要内容,因此将领域知识引入Email特征表示中,并在此基础上提出了一种综合领域知识和词频的特征表示方法,用于Email分类。该方法是在词频特征的基础上加入人工总结出的领域特征,从而更能准确地表示Email的主要内容,以提高Email分类的平均F-score。通过实验,验证了基于SVM和领域综合特征的Email自动分类方法能有效地提高Email自动分类处理的准确性。
The process of analyzing and organizing Emall messages is a challenging application of Web and Text mining techniques. A novel automatic Email classification method based on support vector machines and knowlcdge-based hybrid features is put forward on the basis of the research of existing email classification methods in this paper. We firstly apply SVM learning algorithms to Email classification, also investigate the effects of various kernel function and feature selection. Whereas Email feature representation based on word frequency cannot represent the topic of an Email precisely, this paper presents a hybrid feature representation method for Email classification. It adds knowledge based features in bag-of-word features to improve F score in Email classification. Experimental results show that this method can effectively improve Email classification accurateness.