Hadoop系统在处理多表链接问题时,每轮都会将大量的中间结果写入本地磁盘,从而严重降低了系统的处理效率.为解决该问题,提出一种“替换-查询”方法,该方法通过对链接表建立索引,将预输出的元组集替换为索引信息输出到中间结果,以索引的形式参与多表链接,以此减少中间结果的I/O代价.运用缓冲池、二次排序和多线程技术对索引信息进行优化管理,加快索引查询速度.最后在TPC?H数据集上,设计了与原Hadoop的对比实验,结果表明该方法可减少35.5%的存储空间,提高12.9%的运行效率.
When Hadoop is used to deal with the issue of multi?table connection,a large number of intermediate resultsare written into local disks. As a result,efficiency of the system becomes very low. In order to solve this problem,a “Replace-Query” method is proposed. By building indexes for the connected tables,the pre-output tuple set are replaced as index informa-tion to send to the intermediate results. The I/O cost of the intermediate results becomes quite low. In order to improve systemperformance,it makes full use of buffer pool,secondary sort and multi-thread technique to optimize the management of indexes.These indexes participate in the whole multi-table connecting process and the records can be fully and rapidly recovered by que-rying. An experiment for contrasting it with the original Hadoop was designed on TPC-H data set. The results show that this methodprovides a 35.5% reduction in space consumption,and improves the running efficiency of 12.9%.