针对Range partition算法不能优化数据集严重倾斜情形下的两表连接效率问题,提出一种改进的数据倾斜连接算法.该算法将倾斜数据和非倾斜数据区别处理,利用复制、广播方法将数据发送到每个Reduce节点,通过一轮Map/Reduce任务完成所有的连接操作,可有效均衡每个Reduce处理量,解决了数据严重倾斜对两表连接性能的影响.与传统的分区连接算法比较结果表明,该算法有效.
Aiming at the problem that Range partition algorithm could not optimize two table join efficiency, which contained heavily skewed data, we proposed an improved algorithm for the data skew connection. The algorithm took different treatment for skew data and non-skew data, sent data to each Reduce node by using the methods of replicating and broadcasting, and completed all the connection operation through a round of Map/Reduce tasks. The algorithm could effectively balance processing of each Reduce, which solved the impact of the heavily skewed data on the performance of two table ioin. The results show that the algorithm is effective by comparing with the traditional partition join algorithm.