东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

Hadoop和Spark在实验室中部署与性能评估

ISSN号：1006-7167
期刊名称：《实验室研究与探索》
时间：0
分类：TP302.1[自动化与计算机技术—计算机系统结构;自动化与计算机技术—计算机科学与技术]
作者机构：南京信息工程大学电子与信息工程学院,江苏南京210044
相关基金：国家自然科学基金项目（61203273）; 江苏省自然科学基金项目（BK20141004）; 南京信息工程大学大学生实践创新训练计划项目（201410300175）

关键词：大数据, 分布式计算, HADOOP, YARN, SPARK, distributed computing, Hadoop, YARN, Spark

中文摘要：

随着互联网技术的发展,数据量成爆炸性增长趋势,单机难以存储、组织和分析这些海量数据。面对单机难以处理海量数据的现状,建立分布式计算平台对于今后科研工作和实验教学具有重要的意义。就如何在实验室环境下搭建分布式计算平台做了详细说明并对hadoop和spark的性能进行比较,包括Hadoop和Spark集群的安装和部署,Spark集成开发环境的建立,同一组数据集在两个平台上进行Kmeans聚类的时间对比。对于建设分布式计算平台具有一定的指导意义。

英文摘要：

With the development of the Internet technology,data volume is streaming. A single machine cannot store,organize and analyze massive data. Facing to the current situation,it is meaningful to build distributed computing platform for further research and experimental teaching. This paper gives a detailed description of the establishment of distributed computing platform and makes a performance comparison between Hadoop and Spark. The comparison focuses on the time consuming,and includes the building of Hadoop and Spark platforms,establishing the Spark development environment,using an identical set of dataset to do Kmeans clustering. It will be helpful for someone who is going to construct distributed computing platform.

同期刊论文项目