鉴于词语知识对提高文本聚类性能的价值,提出了一种用线性插值方式把词典词语之间的量化关系和余弦相似度结合起来的文本相似度计算方法.在实现文本聚类之前,基于词典中一个词条和其释义在语义上等价的假设,构建出词条和释义中词语之间的量化关系,并把这种量化关系值作为文本聚类用到的知识.在k-均值聚类算法的框架下,这种以线性插值方式构造的新的相似度,给文本聚类系统性能带来了明显的提高.实验结果说明从词典中获取的词语量化关系对将来的文本聚类研究可能会有潜在的贡献.
In consideration of the usefulness of the lexical knowledge in improving text clustering, we presented a new text similarity measure on the basis of combining cosine similarity with the quantified lexical relationship within the dictionary by using linear interpolation. Before the implementation of text clustering, the quantified relationship between a dictionary entry and the words in its definition was constructed under the assumption that the entry and its definition were equivalent in sense. This kind of quantified relationship was regarded as knowledge and was used in text clustering. Under the framework of the k-means algorithm, the new similarity measure constructed by linear interpolation improved the performance of text clustering system significantly. The experimental result shows that the relationship knowledge derived from an ordinary dictionary has potential contribution to the text clustering in the future.