XML已成为各种网络应用中数据存储和数据交换的标准.XML数据管理面临的最大困难在于结构与数据混合存储导致大量数据冗余,这极大地增加了XML数据存储、交换和处理的代价.对XML文档进行压缩可以在一定程度上解决这个问题.但现有XML压缩方法大都仅压缩单文档中的冗余信息.利用XML文档间的相似性,提出一种支持查询的多XML文档压缩存储方法XCluster.XCluster先利用XML有根有序标签树上改进的pq-gram近似距离对XML文档集进行层次聚类;然后合并每个聚类结果子集中文档的结构得到结构代表并进行字典编码压缩;同时合并不同文档内同一标签下的值内容,并根据其数据类型进行相应编码压缩.实验结果表明,在真实和生成的XML多文档数据集上,XCluster比XGrind和XQilla具有更好的压缩效果和查询效率.
XML is the de facto standard for data exchange and data storage in network applications.The main problem in the management of XML data is the redundancy caused by its mingling structure and data,which causes high costs in storing,exchanging and processing of XML data.Data compression techniques can be used to reduce such redundancy.However,most of the existing XML compression methods only try to reduce the redundancy in each single XML document,while ignoring the redundancy among XML documents.Presented in this paper,is a new XML compression method XCluster,which utilizes the similarity among XML documents.Queries can be evaluated on the compressed XML documents generated by XCluster directly.XCluster uses the improved pq-gram approximate distance between root-ordered tag trees to cluster the input XML documents hierarchically first.Then it compresses the structures in each clustered subset of XML documents by obtaining a representative structure through merging operation.Finally,it puts data of nodes with same tags into same buckets and encodes data in each bucket with a suitable algorithm according to the type of data.Extensive experiments on both real datasets and synthetic datasets show that XClutster outperforms XGrind and XQilla in both compression ratio and efficiency of query processing.