为克服概率潜在语义索引在存储效率和查询速度方面的不足,引入概率潜在语义词典(PLSD)概念,建立词汇之间的关联矩阵代替词汇文档矩阵.提出一种文档分值计算方法,以及词典中每个词汇的概率计算方法,用以获取相关的查询词汇,从而生成新的查询.实验表明:PLSD的引入消除了概率潜在语义分析对文档的依赖,通过调整文档阂值和词汇筛选等方法,可以在保证查准率的前提下,大幅度减少检索系统占用的存储空间.
To overcome the limitations of PLSI (probabilistic latent semantic index) in storage efficiency and query speed, the probabilistic latent semantic dictionary (PLSD) is presented. It is a matrix containing the relationships between terms instead of relationships between terms and documents. A document score calculation method and a term probability calculation method are provided to extract query terms from documents and then a new query could be generated. The time complexity of PLSD query is demonstrated much smaller than that of PLSI. The experimental result is proven that this melhod could completely eliminate document dependence of the probabilistic latent semantic analysis. By means of adjusting document threshold and pruning such unused data, PLSD could significantly reduce the storage space of retrieval system and improve query speed under the premise of ensuring retrieval precision.