东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于属性和关系的OLAP算法研究

ISSN号：1673-629X
期刊名称：计算机技术与发展
时间：2014.6.21
页码：99-102
分类：TP301[自动化与计算机技术—计算机系统结构;自动化与计算机技术—计算机科学与技术]
作者机构：山东建筑大学计算机科学与技术学院,山东济南250101
相关基金：基金项目：国家自然科学基金资助项目（61170052）
相关项目：面向带约束分析对象的联机分析模型和算法研究

关键词：大数据, MAPREDUCE, 负载均衡, 抽样, big data, MapReduce, load balancing, sampling

中文摘要：

随着移动互联网和物联网的飞速发展，数据规模呈爆炸性增长态势，人们已经进入大数据时代。MapReduce是一种分布式计算框架，具备海量数据处理的能力，已成为大数据领域研究的热点。但是MapReduce的性能严重依赖于数据的分布，当数据存在倾斜时，MapReduce默认的Hash划分无法保证Reduce阶段节点负载平衡，负载重的节点会影响作业的最终完成时间。为解决这一问题，利用了抽样的方法。在用户作业执行前运行一个MapReduce作业进行并行抽样，抽样获得key的频次分布后结合数据本地性实现负载均衡的数据分配策略。搭建了实验平台，在实验平台上测试WordCount实例。实验结果表明，采用抽样方法实现的数据划分策略性能要优于MapReduce默认的哈希划分方法，结合了数据本地性的抽样划分方法的效果要优于没有考虑数据本地性的抽样划分方法。

英文摘要：

With the rapid development of mobile Intemet and the Internet of Things, the data size explosively grows, and people have been in the era of big data. As a distributed computing framework, MapReduce has the ability of processing massive data and becomes a focus in big data. But the performance of MapReduce depends on the distribution of data. The Hash partition function defaulted by MapReduce can＇ t guarantee load balancing when data is skewed. The time of job is affected by the node which has more data to process. In order to solve the problem, sampling is used. It does a MapReduce job to sample before dealing with user＇ s job in this paper. After learning the distribution of key,load balance of data partition is achieved using data locality. The example of WordCount is tested in experimental plat- form. Results show that data partition using sample is better than Hash partition, and taking data locality is much better than that using sample but no data locality.

同期刊论文项目