微博正逐步成为公共信息传播的主要媒体,高效地获取微博数据则显得至关重要,分析微博数据有助于研究者及时了解舆情信息.由于传统网页爬虫无法获取完整的微博信息,微博API又有诸多限制,因此针对新浪微博,设计了一种基于P2P技术的微博爬虫系统.该系统避免了新浪API的功能和连接限制,使用基于模拟登录的网页爬虫,根据用户的地理位置信息划分任务,实现连续高效的数据采集.通过与其他架构的试验比较,证明本系统具有良好的性能,能为舆情分析提供数据支持.
Microblog is becoming the main media to spread public information. Analyzing microblog data can contribute to timely knowing public information for researchers. Therefore, it is important to effectively collect microblog data. To solve the problems that the traditional web clawer could not inquire whole information and the API had lots of restrictions,a distributed crawler system was designed based on P2 P for SINA microblog. The crawler was based on simulated login technology and assigned tasks according to user position information to efficiently collect data continuously. The comparison results with other structures show that the proposed system has good performance to provide adequate data.