基于词频的特征表示方法难以准确表示Email的主要内容,从而导致分类的综合性能(F—score)较差,为了解决这个问题,将领域知识引入了Email的特征表示,并在此基础上提出了一种综合领域知识和词频的特征表示方法,用于Email分类.本方法在词频特征的基础上加入人工总结出的领域特征,从而更加准确地表示Email的主要内容,以提高Email分类的平均F—score.基于1080篇Email的分类测试结果表明,与基于词频的特征表示方法和基于领域知识的特征表示方法相比,本方法在针对Email标题实现的Email分类中将平均F—score分别提高了12.28%和23.08%,从而达到69.33%的分类平均F—score.
The feature definition method based on word frequency cannot represent the topic of an email precisely, and then results in low F-score in email classification. To settle this problem, this paper presents a hybrid feature definition method for Email classification. It adds knowledge-based features in bag-of-word features to improve F-score in email classification. Experimental results show that based on this method, the average F-score of Email classification with Email subject is increased by 12.28% and 23.08% compared with word frequency based feature definition method and knowledge-based feature definition method, respectively, and then achieves 139.33% consequently.