随着多数生物基因组测序工作的完成,基因识别就显得尤为重要.CpG岛在基因组中有着重要的生物学意义,因此识别CpG岛将有助于基因的识别.目前已经构建的一些识别CpG岛的位置的模型大都存在标注偏差、需要独立假设等缺点,为此提出一种基于条件随机场(CRFs)模型的CpG岛的位置识别的新方法.该方法将识别CpG岛的位置的问题转化为序列标记问题,并根据CpG岛的位置的性质设计了相应的模型构建、训练以及解码的算法.利用本文算法可以对输入序列确定最有可能的标注序列,从而识别CpG岛的位置.通过对标准数据库的数据进行测试,其实验结果表明本文算法是可行的、高效的,比HMM方法有更高的准确率.
While the genomes of the organisms have been sequenced,gene prediction becomes one of the most important projects.CpG islands are of important biological significance in the genomes.CpG islands location identification is helpful for gene prediction.In order to overcome the shortcomings of existing models such as the strong independence assumptions which generative model must have,the label-bias problem exhibited by maximum entropy markov model and other non-generative models,we present a novel method for CpG islands location identification based on conditional random fields model.The method transforms the problem of CpG islands location identification into sequential data labeling.Based on the properties of CpG islands location,we design the corresponding methods of model constructing、 training and decoding.In this paper,we also design the corresponding feature functions and obtain the weights from the joint distribution over the label sequence given observation through a learning procedure on training data.Then according to the distribution model obtained,we can determine the labeled sequence with maximum probability and thereby identify the location of CpG islands.We test our algorithm by the use of the data sets from the standard database.The experimental results show that compared with other traditional algorithms,our algorithm is more practicable and efficient than the method of HMM.