随着语料库规模的不断扩大和基于语料库的应用研究逐步拓展,对语料库的全文检索成为语料库系统中不可缺少的重要的组成部分。文章对面向大规模语料库的全文检索系统的索引模式、检索算法、检索表达式的构建、自动分词、系统组成等进行了研究,并基于大规模语料库的语言文字信息处理和应用研究的需要,开发了中文信息处理系统——“CIPP”。目前该系统具有全文检索、自动分词、语言统计等功能,在千万字数量级的语料库中,其全文平均检索时间小于1秒。
Recent years have seen great expansion in Corpus scale and in application of corpus technology. Full-text search has become an indispensable component for a corpus. This thesis reports research on index model, search algorithm, search expressions, automatic Chinese segmentation, and system structure in large scale corpus systems. The paper also expounds CIPP, a Chinese information processing system implemented for the purpose. The system is efficient in full-text search, automatic Chinese segmentation and statistics. Time spent on conducting full-text searches in 10-million-token corpora is less than 1 second.