针对当前常用的xML压缩算法没有考虑中文特点的情况,结合中文与XML的特点,提出一种高压缩率的适合中文XML文档的压缩算法COX。利用中文分词技术对XML文档进行分词处理,通过统计词频后获得排序的词典,利用Huffman编码思想对高频及长词汇进行压缩编码;解析XML文档后,把文档元素进行分类,同一类型的元素放入同一容器之中;算法还特别针对数字类型的数据进行了特殊处理。实验结果显示,相对于通用的压缩软件,COX具有更好的压缩效果,但压缩和解压缩时间要慢一些。
To overcome the shortcoming of the current XML compression algorithms which do not distinguish be- tween Chinese characters and English words, it presents a Chinese-oriented XML compressor with high compres- sion ratio, called COX. The input documents are preprocessed by using the technology of Chinese word segmenta- tion, the sorted dictionary is obtained by counting the word frequency, and then the high-frequency and long-size words are coded by using the Huffman coding method. The items in the XML documents are classified by analyzing the documents, the items with the same class tag are sent to the same container. Moreover, the numerical data are processed especially jn COX. The experimental results show that, compared to the general compression algorithms, COX achieves higher compression ratio if the XML documents contain more Chinese words, while needing more compression and decompression time as return.