东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于生物信息学特征的DNA序列数据压缩算法

期刊名称：基于生物信息学特征的DNA序列数据压缩算法,电子学报,2011,Vol.39(5), pp.991-
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]深圳大学计算机与软件学院,广东深圳518060, [2]浙江大学生物医学工程与仪器科学院,浙江杭州310027, [3]利物浦大学电气电子工程系,利物浦,L69 3GJ,UK
相关基金：国家自然科学基金（No.60872125）; 霍英东教育基金会高等院校青年教师基金基础性研究课题; 深圳市基础研究项目（杰青奖）; 广东省自然科学基金
相关项目：基于近似重复矢量的DNA序列数据压缩算法研究

关键词： DNA数据压缩, 生物信息学, 序列重组, 近似重复片段, LZMA, DNA sequence data compression, bioinformatics, sequence regroup, approximate repeat fragment, Lempel-Ziv-Markov chain algorithm（LZMA）

中文摘要：

本文通过将生物学特征和生物学含义引入DNA序列数据的压缩处理中,提出了基于生物信息学特征的BioLZMA压缩算法.在BioLZMA算法中,DNA序列根据组成部分生物学含义的不同切分重组为四个集合：编码序列CDS集合、内含子序列集合、RNA序列集合以及剩余序列的集合.根据各集合中序列的具体生物学特征分别使用针对性的压缩策略进行预处理,并通过LZMA算法进行压缩编码.实验结果表明,BioLZMA算法在基准测试序列上的压缩性能优于原有的DNA序列压缩方法.特别是对于生物信息学特征清晰的长序列,算法能够在较短的时间内获得较高的压缩率.

英文摘要：

A novel bioinformatics features based DNA Sequence data compression algorithm of BioLZMA is proposed in this paper.In BioLZMA,the DNA sequence data is sliced and reformed into 4 clusters according with biological meanings：the coding sequence cluster,the intron cluster,the RNA cluster and the residual cluster.By employing pointed compression strategies in data pre-processing,the clusters are compressed separately with LZMA algorithm.Experimental results demonstrated the better performance of BioLZMA than original DNA compression algorithms on benchmark sequences.Especially on long DNA sequence with significant bioinformatics features,BioLZMA algorithm can achieve higher compression ratio with little computation time.

同期刊论文项目