随着经济全球化程度的加深,我国与各国之间的交流、合作越来越频繁,各类大小事件的新闻报道各国各有侧重,新闻内容的匹配程度也有高低之分。传统的文本相似度方法具有计算维数过高和计算过于复杂的缺点。通过对新闻报道文本的分析发现,新闻报道具有何时、何地、何事、何因、何人五个基本因素的特点。针对这一特性,提出融合新闻要素的跨语言新闻文本相似度计算方法。该方法充分考虑到了新闻文本的五个新闻要素特征词对文本相似度的影响,有效减少了相似度低的文本干扰和传统文本相似度计算效率的问题。本文方法抽取新闻文本的新闻要素,借助翻译工具和词义消歧技术将抽取出来的不同语种的新闻要素统一为中文,并对新闻要素进行分类集合,然后利用集合相似度计算和数据融合方法来计算两篇新闻文本相似度,通过实验验证,本文方法对跨语言新闻文本相似度计算具有一定的效率和准确性,说明本文方法可行。
With the development of economic globalization, China's communication and cooperation with other countries become more and more frequent. Each country has different emphasis on news reports of events of all sizes and the matching degree of news content also has high and low points. The traditional text similarity computing method has disadvantages that calculation dimension is too high and similarity computing is too complex. Through the analysis of the news text, we can found that news reports have five basic factors characteristics of when, where, what, why, who. According to these features, we put forward the calculation method about similarity of the cross-language news text mixing together with news elements. This method fully takes into account the influence of the five news feature words of news text on the text similarity, which effectively reduces the problems of low similarity text interference and efficiency of traditional text similarity computing. In this paper, it extracted news elements of news texts, uesed translation tools and lexical disambiguation techniques to unity the news elements of different languages in Chinese, then classified and set the news elements and used the ensemble similarity computation and data fusion method to calculate the two pieces of news text similarity. Through the experimental verification, this method had a certain efficiency and accuracy in the cross-language news text similarity computation, and it shows that this method is feasible.