东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

面向大规模语料库的全文检索系统研究

ISSN号：1003-6938
期刊名称：《图书与情报》
时间：0
分类：G356[文化科学—情报学]
作者机构：[1]南京师范大学文学院,江苏南京210029, [2]西北民族大学藏语言文化学院,甘肃兰州730030
相关基金：本文系江苏省社会科学基金项目《语料库通用加工与应用工具开发研究》（批准号：07YYB003）与国家社科基金2005重点项目《藏语语料库建设研究》（批准号：05AYY001）研究成果之一.

关键词：语料库, 全文检索, 自动分词, Corpus , full-text retrieval , automatic segmentation

中文摘要：

随着语料库规模的不断扩大和基于语料库的应用研究逐步拓展，对语料库的全文检索成为语料库系统中不可缺少的重要的组成部分。文章对面向大规模语料库的全文检索系统的索引模式、检索算法、检索表达式的构建、自动分词、系统组成等进行了研究，并基于大规模语料库的语言文字信息处理和应用研究的需要，开发了中文信息处理系统——“CIPP”。目前该系统具有全文检索、自动分词、语言统计等功能，在千万字数量级的语料库中，其全文平均检索时间小于1秒。

英文摘要：

Recent years have seen great expansion in Corpus scale and in application of corpus technology. Full-text search has become an indispensable component for a corpus. This thesis reports research on index model, search algorithm, search expressions, automatic Chinese segmentation, and system structure in large scale corpus systems. The paper also expounds CIPP, a Chinese information processing system implemented for the purpose. The system is efficient in full-text search, automatic Chinese segmentation and statistics. Time spent on conducting full-text searches in 10-million-token corpora is less than 1 second.

同期刊论文项目