为全面、即时搜集分散于互联网上游离的气象数据,满足各行业、各领域、各学科科研部门的数据需求,提出使用Google MEAN Stack全栈技术开发基于Cheer IO的高效定向爬虫,充分利用Node.js高性能I/O的特性,实现气象信息的快速搜集。同时将技术栈与地理信息系统技术、数据可视化技术以及云计算技术相结合,通过地理信息系统的数据存储、查询、自动制图、统计分析等功能对信息进行分析和处理,在阿里云平台上构建了一个能抓取并存储海量数据、提供实时气象数据的应用系统,提供便捷的检索、查询功能,有较强的实用性。本文结合气象数据爬虫的解决方案,对MEAN Stack数据爬虫的开发框架、项目架构以及爬虫核心技术(抓取目标策略、网页分析算法、多线程并发运算等)进行了深入分析研究与实现。
To collect the meteorological data dispersed in various industries,fields and disciplines in a comprehensive and real-time way,and meet the needs of scientific research departments for data,an efficient directional crawler was developed based on Google 's full-Stack technology called MEAN( Mongo DB + Express + Angular JS + Node. js) Stack and an fast flex Javascript Document Object Model module called Cheer IO,the functions such as fast-gathering weather information,information analysis and processing by data storage,query,automatic mapping,statistical analysis,forecasting of GIS were realized. An application system deployed on Alicloud server which can real-timely update and forecast meteorological data was created,and it can also provide practical functions of massive data storage,convenient search and query. An efficient and practical web application system was built,which not only provided effective solutions for scattered online data collection but show people date intuitively by using HTML5 data visualizing technology. In actual project,it offered a great number of data support and example to the weather-related fields,such as forestry and preventive medicine. GIS data visualization is a constantly evolving concept,whose borders are expanding fast. At the age of the internet,especially in the globalization of information,the long-term value of data has been gained more and more recognition and affirmation from small companies to national political decision-making. It should be recognized what really it is and how it can help us.