在文档相似性检测中,粗粒度会降低准确度,粒度过细又会大幅增加计算时间。针对基金项目相似性检测,在b位Minwise Hash算法的基础上,提出了一种细粒度文档相似性快速检测方法。先对文档进行预处理,提取文档正文,并生成分组指纹特征,再构建细粒度的分组指纹索引结构,利用海明距离来计算文档之间的相似性,以XML文档形式存储并显示相似信息。通过系统的实现,验证了该方法的有效性,且检索效率有所提高。
In document similarity detection, coarse grain will reduce the accuracy and too small particle size will increase the computation time. Proposes a quick document similarity detection method based on b-bit Minwise Hash.Firstly extracts the document text to generate a grouping fingerprint features; Then establishes the index structure of finegrained grouping fingerprint; Finally computes the resemblance of document part by Hamming distance, and stores and displays the evidence of similarity by XML document format. Through system practice, verifies the effectiveness of the method and increases the efficiency of retrieval.