随着Web技术的飞速发展,海量数据的管理与搜索变得尤为重要。海量信息的异构性和动态性特点要求信息集成需要Web爬虫来自动获取这些页面,以便进一步处理数据。而一些企业内部的资料既要保密又要供不同的内部职员使用,这种既开放又保守的特点成为企业发展的瓶颈。为了帮助用户完成这样的任务,本文改变传统的资源共享形式,为企业提供了一个高效便利保密的资源共享管理平台--企业搜索引擎(ESE),提出了一种基于主题式爬虫的Deep Web页面的企业搜索引擎(ESE)的和基于开源Java Lucene的索引企业搜索系统设计与实现方法。通过在电信行业Deep Web站点部署实验,经运行检验,结果达到了设计指标要求,为电信行业搜索发挥了作用。并对搜索的精度、速度,以及垃圾网页反舞弊等方面研究进行了展望。
As the web rapidly grows,massive data management and search becomes particularly important.Heterogeneous mass information and dynamic characteristics of information integration require Web crawlers to automatically access these Web pages in order to further process the data,the internal confidential information of enterprises must be only used by different internal staffs,the openness and conservative features become the major bottleneck for the enterprise development.To help out this task,some forms of the traditional resource sharing are changed,an efficient,convenient,and confidential resource sharing management platform-Enterprise Search Engine(ESE) is provided,and the design and implementation method for Deep Web ESE based on topical crawl and indexed enterprise search systems based on open source Java Lucene is proposed.After the deployment and experiment of Deep Web site in the telecommunications industry,the results are proved to meet the design target.It plays an important role in the telecommunications industry.Finally,the studies on the search accuracy and speed,anti-spam pages and fraud,etc are looked forward.