针对当前网络中视频媒体数量大、更新快、内容多、下载难,以及基于单机的视频网络爬虫系统中的处理速度慢、并发度低和下载速度慢等问题,提出了基于Hadoop框架的视频爬虫系统,为视频爬取提供了高并发度的处理和爬取速度.通过MapReduce计算模型实现网页抓取、分析、去重及下载等计算任务,Hadoop分布式文件系统(HDFS)存储各阶段计算任务的计算结果,运用多处备份机制,使得在某个结点退出时转移任务集,不影响整个系统的稳定性和有效性.实验结果表明完全分布式基于Hadoop的视频爬虫系统无论在单位时间内的视频下载速率还是爬取网页个数都明显高于未基于Hadoop的和伪分布式的视频爬虫系统.
Current network has numerous,comprehensive,update sooner video content,and there exists some flaws such as slow disposing speed,low concurrency and slow download speeds in the video content crawler system based on single PC(personal computer).In consider of these problems,the video content crawler system was proposed based on Hadoop framework to acquire high concurrency processing and crawling speed.The MapReduce computation model was used to implement crawling,analysis,duplicate removal,downloads and other computing tasks,and the Hadoop distributed file system(HDFS)was used to do the storage for the coordination with the computing model.Experiment demonstrates that the video content crawler system based on Hadoop both in downloading speed and crawling webpage numbers are significantly higher than the single and pseud-distributed one.