从海量文档中快速有效地搜索到相似文档是一个重要且耗时的问题。现有的文档相似性搜索算法是先找出候选文档集,再对候选文档进行相关性排序,找出最相关的文档。提出了一种基于文档拓扑的相似性搜索算法——Hub-N,将文档相似性搜索问题转化为图搜索问题,应用相应的剪枝技术,缩小了扫描文档的范围,提高了搜索效率。通过实验验证了算法的有效性和可行性。
Searching for similar documents from the large number of documents quickly and efficiently is an important and time-consuming problem.The existing algorithms first find the candidate document set,and then sort them based on a document related evaluation to identify the most relevant ones.A topology-based document similarity search algorithm——Hub-N is put forward,and the document similarity search problem is transformed into graph search problem,applying the pruning techniques,reducing the scope of scanned documents,and significantly improving retrieval efficiency.It proves to be effective and feasible through experiment.