介绍基于电力公司的多格式文档智能信息搜索系统的设计原理和实现过程.通过PHP调用COM组件以及Java调用jar包,将其他多种文档转换为“.txt”文档,经过分词并采用基于句子特征的文本摘要生成方法生成“.txt”文档的摘要.检索模块采用基于词索引的全文检索,信息检索模型采用空间向量模型,实现摘要及高相关度句子的输出.
This article describes the design principle and implementation process of the intelligent information re-trieval system based on multiple -format document electric power company. This system realizes how to convertPDF, HTML, XLS, D0C file to txt file by calling C0M component using PHP and calling jar package using Java.On this basis we realize the abstract generation of txt file by using Chinese word segmentation and automatic abstracttechnology based on the characteristics of sentences. Retrieval module uses Full - text retrieval based on word in-dex, takes space vector model as information retrieval and realizes the output of abstract and sentences with highcorrelation.