提出了一种高效的自动按照主题对中文词进行聚类的算法.该算法利用顿号(、)切分抽取语料库句子中的并列中文词,并以抽取出的中文词为节点构建一个共引用图;然后对每个中文词节点产生若干个locality sensitiveHashing(LSH)签名组合;最后将至少有1个相同LSH签名组合的任意2个中文词标记为同一个主题类.实验表明,该算法运算速度快,且易并行实现,在海量语料库的支持下,执行效率高,聚类效果较好.
A simple but powerful algorithm for automatically clustering Chinese co-topic words is presented. The method first uses punctuation '、' to split and extract paratactic Chinese words within sentences from a corpus and constructs a co-citation graph by treating Chinese words as nodes. Second, the method generates several locality sensitive Hashing (LSH) signature combinations for each node in the co-citation graph. Those nodes shared at least one LSH signature combination, are grouped together and most of them may belong to the same topic. The main advantages of the algorithm are the fast speed of calculation and high convenience of implementation in parallel. Experimental results indicate the high efficiency and good clustering effect.