文本相似度计算广泛应用于抄袭检测、自动问答系统、文本聚类等文本应用领域,然而传统的方法往往不具有语言无关性,且要花费大量的时间分析提取文档的特征项。针对目前相关方法的诸多不足,提出了一种基于随机n—Grams(Randomn—Gram,记为R-Gram)的长文本相似度算法,该算法具备语言无关性,且可以充分利用短n—Gram的细粒度检测特性和长n—Gram的高效检测特性。实验结果表明:基于R—Gram的文本相似度算法具有快速、操作简单、精度调控灵活等优点,在长文本相似度计算中具有良好的应用价值。
Text similarity computing is widely used in many text applications such as plagiarism detection, automatic question answering system and text clustering. However, most traditional methods for computing text similarity are dependent on a special language and spend much time on analyzing and extracting of feature items. In view of the shortages of traditional methods, a novel algorithm based on Random n-Grams (R-Gram) with language independence for long text is proposed, which can make full use of fine-grained characteristics of short n-Grams and high efficiency characteristics of long n-Grams. The results strongly suggest that text similarity algorithm based on R-Gram have the advantages of fast speed, easy operation and flexibility. As a bonus, it is beneficial for text similarity computing for lung texts.