大数据的规模效应给数据存储、管理以及数据分析带来了极大的挑战,高效率低成本的大数据处理技术成为学术界及工业界的研究热点。为提高协同过滤算法的执行效率,对MapReduce架构下的算法执行步骤进行了分解,并对算法执行缺陷进行了分析。结合Spark适于迭代型及交互型任务的特点,提出将算法从MapReduce平台移植Spark平台的改进思路。设计了算法在Spark中的实现流程,并通过参数调整、内存优化等方法进一步提高算法效率。实验结果表明:与MapReduce平台中的算法相比,基于Spark DAG调度的算法能够减少65%以上的HDFS重复I/O操作,执行效率与能耗效率分别提升近200%及50%。
The scale effect of big data has brought great challenges to data storage, management and a- nalysis. And the high efficiency and low cost big data processing technology has become a hotspot re- search in academia and industry. In order to improve the efficiency of collaborative filtering algorithms, the implementation of the algorithm under the MapReduce architecture is decomposed in order to analysis the defects of the algorithm. For the Spark suitable for the iterative and interactive tasks, this paper pres- ents the methods to improve the execution efficiency from the MapReduce platform to the Spark platform. The implementation flow of the algorithm in Spark is designed, and efficiency is improved by parameter adjustment and memory optimization. Experimental results show that: based on spark DAG scheduling, the algorithm can reduce more than 65 % HDFS I/O operations and enforce the efficiency and energy effi- ciency were increased by nearly 200% and 50%.