传统的聚焦爬虫在主题未知或者缺少相应训练集的情况下无法完成主题爬行。为让聚焦爬虫具有更好的主题适应性,提出基于聚类算法的自适应主题模型,指导聚焦爬虫在只有少量相同主题(主题未知)初始url的情况下完成主题爬行。通过对初始页面聚类得到主题中心向量,寻找相关网页更新主题中心位置;基于best-first策略实现url排序;基于该模型实现用户定制主题聚焦爬虫。通过对比实验验证了使用该模型的爬虫具有较高的收获比(havest rate)。
The traditional focused crawler can not work without train sets of correspond topics. To make the focused crawler adapt to more topics, a clustering-based adaptive topic model was proposed, which helped the focused crawler to work with some url with the same topic. The topic vector was obtained by clustering the initial page, and correspond page was found out to update the topic vector, the url with the best-first strategy was ordered then. Based on the adaptive topic model, a user customized topic focused crawler was implemented. Finally, an experiment was executed. The results prove the focused crawler with the adaptive topic model performs well.