随着各种生物基因组序列测定工作的完成,大量的DNA序列数据涌现出来,为研究在基因组中寻找水平转移基因提供了极大的便利.将基因序列特征分析和支持向量机技术结合起来,通过分析基因序列的特征差异发现水平转移基因.依据以前研究工作的基础,选取了绝对密码子使用频率(FCU)作为序列特征,主要因为它既包含了基因密码子使用偏性的信息,也包含了基因所编码蛋白的氨基酸组成信息,支持向量机利用这些信息进行水平转移基因分析和预测,可以提高预测的准确性.另外,提出了基于分链的水平转移基因预测新方法,即将细菌基因组前导链和滞后链上的基因区别对待,分别进行水平转移基因预测.结果显示,基本预测方法要优于目前预测结果最好的Tsirigos等提出的基于八联核苷酸频率的打分算法,命中率的相对提高率最高达31.47%,而基于分链的方法对水平转移基因的预测取得了更好的结果.
Horizontal gene transfer (HGT), also Lateral gene transfer (LGT), is any process in which an organism transfers genetic material to another species that is not its offspring. With the increase of available genomic data, it has become more convenient to study the way to detect the genes, which are products of horizontal transfers among a given genome. There are few data about known horizontal gene transfers in three bacterium genomes under consideration, so the experiments, which simulated gene transfer by artificially inserting phage genes, were carried out. Combining the feature analysis methods of gene sequences with support vector machine (SVM), a novel method was developed for identifying horizontal gene transfers (HGT) in 3 fully sequenced bacterium genomes (Escherichia coli K12, Borrelia burgdooceri, Bacillus cereus ZK). According to our previous work, codon use frequency (FCU) was selected as the sequence feature, in respect that it is inherently the fusion of both codon usage bias and amino acid composition signals. In addition, another computational method was proposed considering strand asymmetry and predicting horizontal gene transfers of leading strand and lagging strand of genomes under consideration, respectively. To avoid the occasionality of simulating gene transfer through artificially inserting phage genes, 100 times of the transfer-and-recover experiment were repeated and arithmetic average of measurement for each genome being considered were reported to evaluate algorithm's performance. Ten-fold cross-validation was used for both parameter and accuracy estimation. The best results were obtained for C-Support Vector Classification (C-SVC) type by using the radial basis function kernel with γ=100, while for one-class SVM type the best performance was obtained using the polynomial kernel of three degree. The performance of the approach was compared with that of Tsirigos' method ,which is one of the best predictive approachs to date in detecting of horizontal trans