文档复制检测是保护知识产权、提高信息检索效率的有效手段。提出一种基于指纹和语义特征的文档复制检测方法。介绍了指纹提取算法以及相关的重叠度度量,并且以知网的概念描述为基础对文本进行语义分析,利用词类信息和语义规则进行歧义消解,并采用基于框架的层级表示方法描述句子的语义特征。在3种测试集上把该方法与现存的方法在检测准确率上进行比较,实验结果表明该方法能够有效地检测出各种方式的复制文本。
Copy detection for digital documents is a powerful tool to protect the author's intellectual property and to improve the efficiency of information retrieval. A document copy detection method based on fingerprint and semantic feature is proposed. The fingerprint extraction algorithm and corresponding overlap measure are introduced. Syntactic parsing and semantic analysis are combined on the basis of the description of the concepts in the HowNet, and the part of speech and semantic rule are used to eliminate ambiguities. A frame-based hierarchy approach is used to represent the semantic features of a sentence. The proposed method is compared with the existing ones from three aspects. The experiments validate the efficiency of the proposed method.