基于类的统计语言模型是解决统计模型数据稀疏问题的重要方法.传统的统计聚类方法基于含婪原则,常以语料的似然函数或困惑度(perplexity)作为评价标准.这种传统的聚类方法的主要缺点是聚类速度慢,初值对结果影响大,易陷入局部最优.本文利用互信息定义一种词相似度,基于相似度,提出一种自下而上的分层聚类算法.实验证明,该算法在计算复杂度和聚类效果上比传统的基于贪婪原则的统计聚类算法都有明显的改进.在提高预测能力方面,提出一种新的基于类的可变长语言模型(Vail—gram)的生成方法.
Cluster-based statistic language model is an important method to solve the problem of sparse data. Conventional statistical clustering methods usually base on greedy principle. The common Metric for evaluating a clustering algorithm is the likelihood function or perplexity of the corpus. Conventional clustering algorithms often converge to a local optimum, so global optimum is not guaranteed,and initial choices can influence final result. The author tries to solve above problems in this paper, and presents a definition of word similarity by utilizing mutual information. Based on word similarity, a bottom-up hierarchical clustering algorithm is proposed. Experiments show that word clustering algorithm based on similarity is better than conventional greedy clustering method in speed and performance. At the same time, a new method to create the vari-gram language model is presented.