在领域主题爬虫中,为提高网页爬取的效率和准确性,将扩展主题特征库(extended topic feature library,ETFL)引入进爬虫的网页过滤算法中。将网页抽象为标签块节点集,通过主题特征库扩展算法对静态特征项进行扩充生成扩展主题特征库,利用网页主题特征项提取算法从页面中抽取出特征项,在爬虫抓取网页的过程中,通过基于扩展主题特征库的网页相关性判断方法对页面进行过滤。该算法弥补了传统的基于静态关键词项的网页过滤算法对页面语义层次处理的缺失。实际项目运行结果表明,在领域主题爬虫中引入扩展主题库能够有效提高网页抓取精度,具有较高可用性。
To improve the efficiency and accuracy of Web crawling in focused crawler,extended topic feature library was intro-duced into Web page filtering algorithm.Web page was abstracted as a set of label block nodes,static feature items were expan-ded to generate extended topic feature library using topic feature library extension algorithm,and Web page topic feature item ex-traction algorithm was used to extract feature items from pages.During the process of crawler fetching documents,Web pages were filtered using Web page relevance decision method based on extended topic feature library.The algorithm makes up the va-cant problem of semantic processing in traditional Web page filtering algorithm based on static keyword items.Results of applica-tion in actual proj ects show that the introduction of extended topic feature library into focused crawler can improve the accuracy of Web scraping,and it possesses higher availability.