东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

XCluster：基于聚类支持查询的XML多文档压缩方法

ISSN号：1000-1239
期刊名称：《计算机研究与发展》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]哈尔滨工业大学计算机科学与技术学院,哈尔滨150001
相关基金：国家“九七三”重点基础研究发展计划基金项目（2006CB303000）; 国家自然科学基金重点项目（60533110）; 国家自然科学基金项目（60703012 60773068）; 黑龙江省青年科技专项资金项目（QC06C033）; 国家“八六三”高技术研究发展计划基金项目（2009AA01Z149）; NSFC/RGC联合科研基金项目（60831160525）~~

关键词：树型XML, XML压缩, pq-gram, 层次聚类, 查询处理, tree-structured XML, XML compression, pq-gram, hierarchy clustering, query processing

中文摘要：

XML已成为各种网络应用中数据存储和数据交换的标准.XML数据管理面临的最大困难在于结构与数据混合存储导致大量数据冗余,这极大地增加了XML数据存储、交换和处理的代价.对XML文档进行压缩可以在一定程度上解决这个问题.但现有XML压缩方法大都仅压缩单文档中的冗余信息.利用XML文档间的相似性,提出一种支持查询的多XML文档压缩存储方法XCluster.XCluster先利用XML有根有序标签树上改进的pq-gram近似距离对XML文档集进行层次聚类;然后合并每个聚类结果子集中文档的结构得到结构代表并进行字典编码压缩;同时合并不同文档内同一标签下的值内容,并根据其数据类型进行相应编码压缩.实验结果表明,在真实和生成的XML多文档数据集上,XCluster比XGrind和XQilla具有更好的压缩效果和查询效率.

英文摘要：

XML is the de facto standard for data exchange and data storage in network applications.The main problem in the management of XML data is the redundancy caused by its mingling structure and data,which causes high costs in storing,exchanging and processing of XML data.Data compression techniques can be used to reduce such redundancy.However,most of the existing XML compression methods only try to reduce the redundancy in each single XML document,while ignoring the redundancy among XML documents.Presented in this paper,is a new XML compression method XCluster,which utilizes the similarity among XML documents.Queries can be evaluated on the compressed XML documents generated by XCluster directly.XCluster uses the improved pq-gram approximate distance between root-ordered tag trees to cluster the input XML documents hierarchically first.Then it compresses the structures in each clustered subset of XML documents by obtaining a representative structure through merging operation.Finally,it puts data of nodes with same tags into same buckets and encodes data in each bucket with a suitable algorithm according to the type of data.Extensive experiments on both real datasets and synthetic datasets show that XClutster outperforms XGrind and XQilla in both compression ratio and efficiency of query processing.

同期刊论文项目