东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于EM的启动子序列半监督学习

ISSN号：1000-1239
期刊名称：《计算机研究与发展》
时间：0
分类：TP391.4[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]烟台大学计算机科学与技术学院,山东烟台264005, [2]青岛大学国际学院,山东青岛266071
相关基金：基金项目：国家自然科学基金项目（60772028）;山东省自然科学基金项目（Y2006G22,Y2008G08）

关键词：马尔可夫模型, 最大似然估计, 启动子识别, 转移概率, 半监督学习, Markov model, maximum likelihood estimation, promoter recognition, transition probability, semi-supervised learning

中文摘要：

启动子的预测对于基因的定位有重要意义.已有多种对启动子进行预测的算法,涉及到信号搜索、内容搜索和CpG岛搜索等多种策略.基于马尔可夫模型的启动子分类方法也有研究,其中的转移概率都是直接通过统计已标号训练样本序列得来的.将半监督学习思想引入启动子序列分析中,推导出转移概率等参数的最大似然估计公式.实验中将待测试基因序列片段同已标号训练样本混合,利用得出的参数值对基因序列片段进行识别,使用少量的已标号的样本数据能得出较好的启动子识别结果.

英文摘要：

The eukaryotic promoter prediction is one of the most important problems in DNA sequence analysis. Promoter is a short sub-sequence before a transcriptional start site （TSS） in a DNA sequence. The prediction of the position of a promoter may approximately describe the position of a TSS, and gives help to biology experiments. Most proposed prediction algorithms are based on some search strategies, such as search by signal, search by content or search by CpG island, their performances are still limited by low sensitivities and high false positives. The promoter classification algorithm based on Markov chain has been proved to be effective in promoter prediction, where parameters such as transition probabilities are calculated by statistics on the labeled samples. In this paper, semi-supervised learning is introduced in promoter sequence analysis to improve classification accuracy with a combination of labeled and unlabeled sequences, and the maximum likelihood estimation formulas for transition probabilities are deduced. In simulating experiments, each long genomic sequence is truncated to short segments, which are mixed with labeled data, and classified according to the calculated probabilities. Comparison with some known prediction algorithms show that semi-supervised learning of promoter sequences based on EM algorithm is efficient when the number of labeled data is small, and the value of Fi is higher than that of predictions based on labeled samples.

同期刊论文项目