针对生物信息学中序列模体的显著性检验问题,提出了一种基于极大似然准则的贝叶斯假设检验方法.将模体的显著性检验转化为多项分布的拟合优度检验问题,选取Dirichlet分布作为多项分布的先验分布并采用Newton-Raphson算法估计Dirichlet分布的超参数,使得数据的预测分布达到最大.应用贝叶斯定理得到贝叶斯因子进行模型选择,用于评价模体检验的统计显著性,这种方法克服了传统多项分布检验中构造检验统计量并计算其在零假设下确切分布的困难.选择JASPAR数据库中107个转录因子结合位点和100组随机模拟数据进行实验,采用皮尔逊积矩相关系数作为评价检验质量的一个标准,发现实验结果好于传统的模体检验的一些方法.
For the significant testing of motif in biological sequences,Bayesian hypothesis testing based on maximum-likelihood criterion is presented.This significant testing of multiple motif is converted into the goodness of fit test of the multinomial distribution.While the prior distribution of the multinomial distribution is known as Dirichlet,the estimates of super-parameters of the Dirichlet prior distribution are given using Newton-Raphson algorithm for maximization of the predictive distribution of the data.Based on Bayesian Theorem,a Bayes factor is obtained for model selection,which acts as statistical estimation of the significance.The method overcomes the difficulty of constructing the statistic test and deriving its exact distribution on the null hypothesis.Selecting 107 alignments of transcription factor binding sites in the JASPAR database and 100 Tandom generated alignments as experimental data,taking Pearson product moment correlation coefficients as an objective criterion of the quality estimation,experimental results indicate that Bayesian testing performed better on average than the classical methods.