东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

COX：高压缩率的中文XML文档压缩技术

期刊名称：计算机工程与应用 (Computer Engineering and Applications)
时间：0
页码：1725-1732
语言：中文
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]华中科技大学计算机科学与技术学院,武汉430074, [2]中国工程物理研究院计算机应用研究所,四川绵阳621900
相关基金：国家自然科学基金委员会与中国工程物理研究院联合基金（No.10876012）.
相关项目：海量数据压缩加密存储技术研究

关键词：中文XML文档, 数据压缩, 中文分词, 词典, Chinese XML document, data compression, Chinese word segmentation, dictionary

中文摘要：

针对当前常用的xML压缩算法没有考虑中文特点的情况，结合中文与XML的特点，提出一种高压缩率的适合中文XML文档的压缩算法COX。利用中文分词技术对XML文档进行分词处理，通过统计词频后获得排序的词典，利用Huffman编码思想对高频及长词汇进行压缩编码；解析XML文档后，把文档元素进行分类，同一类型的元素放入同一容器之中；算法还特别针对数字类型的数据进行了特殊处理。实验结果显示，相对于通用的压缩软件，COX具有更好的压缩效果，但压缩和解压缩时间要慢一些。

英文摘要：

To overcome the shortcoming of the current XML compression algorithms which do not distinguish be- tween Chinese characters and English words, it presents a Chinese-oriented XML compressor with high compres- sion ratio, called COX. The input documents are preprocessed by using the technology of Chinese word segmenta- tion, the sorted dictionary is obtained by counting the word frequency, and then the high-frequency and long-size words are coded by using the Huffman coding method. The items in the XML documents are classified by analyzing the documents, the items with the same class tag are sent to the same container. Moreover, the numerical data are processed especially jn COX. The experimental results show that, compared to the general compression algorithms, COX achieves higher compression ratio if the XML documents contain more Chinese words, while needing more compression and decompression time as return.

同期刊论文项目