数据复用是数据仓库管理中节约存储空间、提高查询效率的重要手段.列存储技术将来自同一属性的数据连续存储,极大地提高了数据仓库等分析型应用软件的性能,同时增加了复用的可行性和灵活性.为此,提出了一种列存储数据仓库中的数据复用策略.首先,利用模式匹配技术发掘候选可复用列,排除大量无法复用的数据列,在此基础上对候选可复用数据进行筛选和过滤,大大降低复用数据检测的复杂度.针对确定的可复用数据,提出了基于列存储的复用实现策略,分别给出了原始数据列、压缩数据列及索引数据列的复用实现方法.最后提出了基于复用数据的查询实现技术.在大规模数据仓库基准数据集上的实验结果验证了数据复用策略在减少存储量、节省数据装载时间及提升查询性能方面的有效性.
Data reusing is an important way to save storage capacity and improve the query efficiency in the management of data warehouse.The column-store architecture stores data from the same column continuously,which greatly improves the performance of " read optimization " application and moreover increases the feasibility and flexibility of data reusing.In this paper,we propose a novel reusing strategy based on the column-store data warehouse.Firstly,we adopt schema mapping technology to search candidate reusable columns and then conduct further filter on these candidate data,which greatly reduces the complexity of reusable data detection.Then based on the column-store architecture,we provide a series of methods for the reuse implement,including reusing the initial data,compression and index data.Finally,we propose the method to implement the query based on the reuse data structure.The experiment results conducted on the large-scale data sets indicate that the presented strategy can reduce the storage space,save data loading time and query execution time efficiently.