随办公信息化、生活网络化不断推进,诸如企业产品问题描述、Web用户评论、通信文本信息等新生的非结构化文本数据也伴随着快速的增长以及其不断积累。这对于如何能准确、高效地检索到用户真实需求的文本信息提出了新的要求和挑战。检索模型对检索准确度、效率等具有决定性影响。近年来,大量新兴方法融入到文本的检索模型中,使模型本身变得纷繁复杂,同时传统模型问的界限变得模糊。从非结构化文本数据的检索需求出发,归纳检索模型的定义和通用框架;进而基于检索词项相似性计算采用的数学理论,对检索模型进行分类,并详细阐述各类模型的发展脉络、分析其优缺点及适用场景。最后,讨论了新环境下海量文本检索模型面临的挑战及相关研究问题思考。
With the promotion of informatization in working and daily life, the new unstructured text data is rapidly accumula- ting in many fields such as product description, users' Web comments, text information communication and so on, so it brings new challenges about how to find the real text information accurately and efficiently to meet users' demands. Retrieval model is the key for both retrieval accuracy and efficiency. Aiming at the retrieval requirements on massive text data, this paper summa- rized the definition and general framework of retrieval model, and then proposed a classification of retrieval models according to mathematical theory applied in different retrieval models. Based on the classification, the paper elaborated evolutionary process, the advantages, disadvantages and application scenarios of each model. Finally, it discussed both the specific challenges and re- searches of retrieval which focused on massive text data.