近年来研究表明使用主题语言模型增强了信息检索的性能,但是仍然不能解决信息检索存在的一些难点问题,如数据稀疏问题,同义词问题,多义词问题,对文档中不可见项和可见项的平滑问题。这些问题在一些领域相关文献检索中显得尤其重要,比如大规模的生物文献检索。本文提出了一种新的基于聚类的主题语言模型方法进行生物文献检索,这主要包括两个方面工作,一是采用本体库中的概念表示文档,并在此基础上进行模糊聚类,把聚类的结果作为数据集中的主题,文档属于某个主题的概率由文档与聚类的模糊相似度决定。二是采用EM算法来估计主题产生项的概率。把上述方法集成到语言模型中就得到本文的语言模型。本文的语言模型能够准确描述项在不同主题中的分布概率,以及文档属于某个主题的概率,并且利用本体中概念部分地解决了同义词问题,而且项可以由不同的主题产生,这也能够部分解决词的多义问题。本文的方法在TREC2004/05GenomicsTrack数据集上进行了测试,与简单语言模型以及现有主题语言模型相比,检索性能得到一定的提高。
Recent researches present topic language model improves the performance of information retrieval, but many problems still has not been solved include data sparseness problem, synonymy and polysemy problems, smoo thing the seen term or not seen term. All the problems are important to IR, especially in domain literature IR, for example biological literatures. In this paper, a new topic language model based on cluster was proposed. The work mainly included two aspects. First, documents were represented by concepts of ontology, and concept-based cluste ring is done using Fuzzy C Means, the clustering result was considered as the topics of document collections. The probability of a document generating topics is estimated by the similarity between the document and each cluster. Then, the probability of topic generating words is estimated using Expectation Maximization algorithm. At last, Through integrating the above algorithms into the aspect model, our topic language model was formed. This new language model accurately describes the distributed probability of words in different topics and the probability of a document generating a topic. Moreover, it can partly solve synonymy and polysemy problems. The new method was evaluated on TREC 2004/05 Genomics Track collections. Experiments have shown that the retrieval performance has been improved by the new method compared with simple language model.