为了解决朴素贝叶斯分类器在处理文本分类任务时,往往存在的特征词缺失问题,即由于语料库中的词语出现分布情况遵循Zipf定律,仅依靠简单的增加训练语料方式难以解决这种因数据稀疏而引发的特征词缺失问题.引入统计语言模型中的数据平滑算法,通过从已出现词中“折扣”出一定的概率再分配到未出现词中去,来计算缺失特征词的补偿概率,以此克服数据稀疏问题带来的影响,评测数据在去掉停用词的分类过程开放测试中,引入Good-Turing算法的分类性能比Laplace原则提高了3.05%,比Lidstone方法提高1.00%.而在交叉熵选择特征词的算法中,增加Good-Turing的贝叶斯分类方法可比最大熵分类性能高1.95%.通过这种数据平滑的算法,有助于克服因数据稀疏而引发的特征词缺失问题.
When applied to deal with text classification task, naive Bayes is always suffered from the unseen feature words problem. Moreover, this problem is hardly to be solved by expanding the corpora for there is the sparse data problem in the corpora, in which the distribution of words complies with Zipf law. Inspired by statistical language model, a novel approach is proposed, which applies the smoothing algorithms to naive Bayes for text classification task to overcome the unseen feature words problem. The experimental corpora come from the data in National 863 Evaluation on text classification, and in the open test with removing the stop words, the naive Bayes performance with Good-Turing algorithm is 3.05% higher than that with Laplace, and 1.00% higher than that with Lidstone. And in the experiment with cross entropy extracting feature words, the performance of naive Bayes with Good-Turing algorithm is even 1.95% higher than that of Maximum Entropy model. The smoothing algorithm is helpful to solve the unseen feature words problem due to the sparse data.