在桌面搜索引擎中,对于二进制文件格式的处理,通常需要针对每一种具体的文件格式编写独立的解析器,复杂度较高且不易维护。从分析开源搜索引擎Lucene出发,提出一种基于Tika和Lucene的桌面搜索引擎框架,能够使用统一的应用编程接口来处理不同二进制格式的文档。整个框架均为开放源代码形式,各模块间耦合度低,易于扩展。在实现方面,基于最新的Lucene4.1,实现了对桌面系统内文档的全文搜索;并在索引性能优化方面,相比于传统的参数配置优化和内存缓冲优化两方面,使用最新的DWPT(documents writer per thread)技术,使索引性能提升了35%。
To process of the binary file format in desktop search engine, writing separately for each specific file format parser is usually needed with a high degree of complexity, which difficult to maintain. By analyzing of the open source search engine Lu cene as a start, a desktop search engine framework based on Tika and Lucene is proposed, to use a unified application program ming interface to deal with different binary format documentation. The framework is open sourced, the degree of coupling be tween modules is low that is easy to extend. As the implementation aspects, this framework based on the latest Lucene4.1 and achieves the full-text search of documents in the desktop system. Besides, compared to traditional parameters configuration opti- mization and memory buffer optimization, the latest DWPT (documents writer per thread) technology is used to optimize index performance. The experimental results show that index performance is improvedly 35%.