不动点迭代广泛存在于数据挖掘和机器学习算法中,这些算法已应用到诸如社会网络分析、高性能计算、推荐系统、搜索引擎、模式识别等诸多领域中.在云计算环境中,利用MapReduce编程模型所带来的便利,通过普通的PC集群运行相应的迭代算法,可以提高迭代算法的执行效率.但由于数据的快速变化,每当数据发生改变,整个迭代算法也需要重新运行,这将会导致大量的运算资源浪费和性能损失.文中研究基于原始迭代结果和新增数据的增量迭代计算DELTA(Delta data based incrEmentaL iTerAtive computing),并提出DELTA模型以解决上述问题.文中理论证明了DELTA模型的正确性,阐述了其适用范围,并列举了PageRank、K-means和Descendant Query算法在DELTA模型中的运用.文中还扩展HaLoop为ΔHaLoop框架,使其支持增量式的迭代计算.通过一系列的测试用例,对DELTA模型功能、性能进行了分析和讨论,实验结果表明DELTA模型在获得准确的迭代结果的基础上性能优势明显.文中提出的DELTA模型能够适应多数迭代算法,对云计算环境下的迭代计算的应用和优化起到推动作用.
The fixed point iterative algorithms widely exist in the area of data mining and machine learning, which have been applied in many fields, such as social network analysis, high- performance computing and recommended system. In cloud computing environment, we can utilize the convenience brought by MapReduce to improve the efficiency of iterative algorithms on big data through running the algorithm on larger PC-cluster. However, the entire iterative algorithm has to be re-executed when new data is introduced, which cause large amount of computing resource wastes and performance losses. In this paper, the original iterative results new data based incremental iterative computing, which is named as DELTA(Delta data based incrEmentaL iTerAtive computing), is well studied, and the corresponding DELTA model is proposed. We prove the correctness of the model, and describe the application scope. Then, the application cases of DELTA model applying on the iterative algorithms are enumerated, such like PageRank,K-means and Descendant Query. Finally, AHaLoop is implemented by extending HaLoop to support DELTA model. A series of test cases are designed to analyze the DELTA model on functionality and performance. The results show that the model improves the iteration performance without any loss of accuracy. The DELTA model proposed in this paper can adapt many iterative algorithms, which promotes the application and optimization of iterative algorithms in cloud computing environment.