虽然通用网络爬行器已经给人们提供了极大的便利,但由于它的综合性不具备面向专业的特点,在准确性和速度等方面存在不足;面向主题的爬行器能弥补这些不足。主要研究面向主题网络爬行器两个方面的问题,即如何充分地定义主题和有效地排序爬行器待下载链接队列中的链接,使得只需访问很少的不相关页面就能够得到很多相关的页面链接。结合网页的半结构化信息特征,提出了一种新的基于内容的爬行策略,实验结果显示是一种寻找主题相关页面很有效的方法。
The general crawler provides more help to people for finding information in WWW. However, it has some drawback in terms of precision and efficiency because of its generality and no specialty. This paper addressed two isshes of the topic-oriented Web crawler. One is how to make the definition of the topic, the other is how to sort of links to be downloaded in the queue efficiently. It aimed to visit only relevant pages, and got a great scale of hyperlinks which link to the relevant pages. The crawl method is a novel one, which was based on the semi-structured features of the website and content information. The results of experiment show that it is a very effective method for focused crawler.