为解决大数据抽样问题,采用MapReduce产生内容满足给定谓词的固定规模样本,并扩展了默认的Hadoop[1]设置,使其支持作业按需动态管理其资源消耗以解决MapReduce进程中的资源浪费问题。实验结果证明:本文所提策略的执行性能优于默认的Hadoop,从而证明MapReduce解决大数据抽样问题的可行性和有效性。
To solve big data sampling problem,this paper uses MapReduce to sample big data and produce a sample whose content satisfy a given predicate. Since the default Hadoop execution depends on the size of the input and is wasteful of cluster resources. The paper has extended the default Hadoop to support job-demand dynamic management of its resource consumption on cluster. Experiments results show that the implementation of the proposed policy performance is better than the default Hadoop policy. Therefore,it was proved that sampling big by using MapReduce is feasible and effective.