文本相似度的计算作为其他文本信息处理的基础和关键,其计算准确率和效率直接影响其他文本信息处理的结果。提出改进的DF算法和TD-IDF算法,一方面利用了DF算法具有线性的时间复杂度,比较适合大规模文本处理的特点,并通过适当增加关键词的方法,弥补了其对个别有用信息错误过滤的不足;另一方面,利用特征项在特征选择阶段的权重对TD-IDF方法进行加权处理,在不增加开销的情况下扩大了文档集的规模,还提高了相似度计算的精确度。
The precision and efficiency of the computing of documents similarity is the foundation and key of other documents process. This paper improved the DF and TF-IDF arithmetic. In this way, DF's time complexity was linearity that suited the mass documents process, and could make up the fault that exceptional useful characters might be deleted. Also, it did a mend On the TF-IDF arithmetic to improve the precision of documents similarity.