聚类算法在抽取文本数据中的模式结构时,忽略多个语种信息之间潜在的互补作用,得到的模式结构不能充分反映数据的内在信息.针对此问题,文中提出基于并行信息瓶颈的多语种文本聚类算法.首先使用词袋模型为文本数据的不同语种信息构建相应的相关变量.然后将多种相关变量引入并行信息瓶颈方法,通过最大化地保存模式结构与多个相关变量之间的信息,使得到的模式结构能够反映数据的多个语种信息.最后提出基于信息论的抽取合并方法优化文中算法的目标函数,保证其收敛到局部最优解.实验表明,文中算法能有效处理文本数据的多个语种信息,性能优于单语种聚类算法和现有的两类能够处理文本多语种信息的聚类算法.
The potential complementation between different languages is ignored while traditional clustering algorithms discover the hidden structures in document collection. Thus, the latent information in the collection can not be reflected by the obtained patterns. Aiming at this problem, multilingual document clustering algorithm based on parallel information bottleneck (ML-IB) is proposed. Firstly, the relevant variables of multiple language information are constructed according to the bag-of-words model. Then, the multiple relevant variables are incorporated into the parallel information bottleneck, and the relevant information between data patterns and multiple relevant variables is preserved maximally. Finally, to optimize the objective function of ML-IB, a draw and merge method based on information theory is proposed to guarantee the convergence of ML-IB to a local optimal solution. Extensive experimental