词语聚类是语音识别、智能信息检索等领域的一个重要的自然语言处理问题。实现基于互信息的对称聚类模型,并针对该模型未考虑词语顺序的缺陷,提出一种新的非对称聚类模型。按照聚类词相对其他词语的位置关系,该模型分为2个子模型,即条件聚类模型和预测聚类模型。在大规模数据集上的实验表明,相对于对称聚类模型,非对称聚类模型是一种更为有效的词语聚类模型。
Word clustering is one of important natural language processing issues in speech recognition and intelligent information retrieval, etc. This paper presents a symmetric clustering model based on mutual information. For the model not taking the order of words into account, it proposes a new asymmetric clustering model including two sub models, conditional clustering model and predictive clustering model. Experimental results on large scale data set show that compared with the symmetric clustering model, the asymmetric clustering model is a more effective one for clustering words.