检索结果聚类是提高检索性能的一种有效手段.其中,如何衡量文档间的相似性是影响聚类质量的关键因素.针对XML文档的内容和结构双重特性,提出了内容与结构语义相融合的扩展向量空间模型,并分析了影响相似性度量的各种特征,进而提出了内容与结构语义相融合的XML语义相似性度量方法.同时,针对IEEE数据集无法提供每篇文档的类别信息,本文从相关文档的分布情况引入了相关簇率和相关文档分布率的概念来进行聚类质量评价.数据集IEEE CS上的实验表明,与同类相似性度量方法和传统方法相比,本文所提方法具有可行性和更好的聚类效果.
Clustering XML search results is an effective way to improve performance. However, the key problem is how to measure similarity between XML documents. Based on dual features of XML documents, this paper proposes extended vector space model which integrates content and structure semantic, analyzes various feature impacting similarity measurement and put forwards a semantic similarity measurement for XML documents. Since IEEE CS corpus has no category information, this paper introduces cluster-relevant ratio and document-relevant distribution and uses them to evaluate clustering quality. Experiment results show that proposed similarity method is feasible, and it produces better clustering quality than other methods.