为了从大量的电子邮件中检测垃圾邮件,提出了一个基于Hadoop平台的电子邮件分类方法。不同于传统的基于内容的垃圾邮件检测,通过在Map Reduce框架上统计分析邮件收发记录,提取邮件账号的行为特征。然后使用Map Reduce框架并行的实现随机森林分类器,并基于带有行为特征的样本训练分类器和分类邮件。实验结果表明,基于Hadoop平台的电子邮件分类方法大大提高了大规模电子邮件的分类效率。
To detect spams from the massive emails, an email classification method based on Hadoop platform is proposed. Different from the traditional context-based spam detection, the proposed method statistically analyze the email records by MapReduce framework to extract behavioral features of each email account. Then Random Forests classifier is implemented in parallel by MapReduce framework. Based on the samples with extracted behavioral features, Random Forests classifier is trained and utilized to classify emails. Experimental results show that, the Hadoop based email classification method largely increases the efficiency of massive email classification.