针对频繁出现的数据冗余、数据复用效率低下等问题,将列存储方式结合并行处理机制对数据复用策略进行优化。构建了基于MapReduce的数据复用并行化处理模型,利用改进型CSM模式匹配算法结合数据挖掘过程中的数据筛选算法,提出并行化数据复用算法。该算法利用数据属性的模式匹配确定属性列之间的对应关系,使用数据检测方式验证属性列数据复用的可行性,从而进行属性列数据筛选,实现并行化的数据复用策略。在大数据环境下的数据仓库中,对大规模基准数据属性集SSB和TPCH中提取的数据实证进行分析,实验结果分析中存储量和处理时间分别减少了17%和35%,验证了并行化数据复用策略在数据存储量、数据处理时间等方面比普通数据复用策略更具高效性。
Aiming at frequently appear data redundancy and data reusable inefficiency problems, this paper combined the col- umn-storage mechanism with parallel processing to optimize data reuse strategy. It built a parallel processing model based on MapReduee of data reuse, and used the improved pattern matching algorithm CSM combine the data screening algorithm to pro- pose parallel data reuse algorithm. This algorithm used the pattern matching algorithm to determine the correspondence be- tween the attribute columns, and through data detected method verifies the feasibility of reusing data attribute columns, thereby filtered the data columns and realized the parallel data reuse strategy. Under the big data, it used the data tables of large scale data sets SSB and TPCH in data warehouse to experiment. The resuhs of storage and treatment time are decreased by 17% and 35% , and verified parallel data reuse strategy has more optimized than the general strategy in data storage and data processing time.