随着互联网技术的发展,数据量成爆炸性增长趋势,单机难以存储、组织和分析这些海量数据。面对单机难以处理海量数据的现状,建立分布式计算平台对于今后科研工作和实验教学具有重要的意义。就如何在实验室环境下搭建分布式计算平台做了详细说明并对hadoop和spark的性能进行比较,包括Hadoop和Spark集群的安装和部署,Spark集成开发环境的建立,同一组数据集在两个平台上进行Kmeans聚类的时间对比。对于建设分布式计算平台具有一定的指导意义。
With the development of the Internet technology,data volume is streaming. A single machine cannot store,organize and analyze massive data. Facing to the current situation,it is meaningful to build distributed computing platform for further research and experimental teaching. This paper gives a detailed description of the establishment of distributed computing platform and makes a performance comparison between Hadoop and Spark. The comparison focuses on the time consuming,and includes the building of Hadoop and Spark platforms,establishing the Spark development environment,using an identical set of dataset to do Kmeans clustering. It will be helpful for someone who is going to construct distributed computing platform.