文本自动分类是自然语言处理领域的重要分支之一,已经形成了大量的模型以及算法,其中基于朴素贝叶斯的相关研究是该领域持续的热点。本文对基于朴素贝叶斯的文本自动分类研究进行了系统的综述。探讨了多项式模型和多元伯努利模型等经典的朴素贝叶斯分类方法。重点分析了经典的特征选择方法以及包括ALOFT等在内的多种改进的特征选择方法。论文还对从加权、避免平滑等视角的NB改进算法进行了梳理。最后,提出了进一步改进NB的主要思路。
Automatic text classification is an important branch of natural language processing, and has already been formed amounts of models and algorithms, included Naive Bayes which is one of sustained research focus in this field. This article summarizes researches on automatic text classification based on Naive Bayes systematically,and discusses classic Naive Bayes methods, including multinomial model and multivariate Bernoulli model. This analyses on classical feather selection methods and some improved methods including ALOFT. And improved NB algorithms are sorted from avoiding smoothing and weighted aspects. Finally, this work presents main idea for NB further improved.