针对生物序列模体的显著性检验问题,提出了一种基于矩估计的贝叶斯检验方法。将模体的显著性检验转化为多项分布的检验问题,选取Dirichlet分布作为多项分布的先验分布,并采用矩估计方法估计Dirichlet的超参数,最后应用贝叶斯定理得到一个贝叶斯因子,用于评价模体检验的统计显著性,这种方法克服了传统多项分布检验中构造检验统计量并计算其在零假设下确切分布中的困难。选择JASPAR数据库中107个转录因子结合位点和100组随机模拟数据进行实验,采用皮尔逊积矩相关系数作为评价检验质量的一个标准,实验结果优于传统的模体检验的一些方法,例如快速傅里叶方法。
For the test of significance of multiple motif in biological sequences, Bayesian testing based on moment estimate was presented. The test of significance was converted to the goodness of fit test of the muhinomial distribution. While the prior distribution of the multinomial distribution was known as Dirichlet, the estimation of super-parameters of prior distribution were given using moment estimate. Based on Bayesian theorem, a Bayes factor was obtained as the statistical estimation of the significance. The proposed method overcame the difficulty in constructing the statistic test and deriving its exact distribution on the null hypothesis. 107 alignments of transcription factor binding sites in the JASPAR database and 100 randomly generated alignments were selected as experimental data. Pearson product moment correlation coefficients were taken as the objective criterion of the quality estimation. Experimental results indicated that Bayesian test performed better than some classical methods, such as fast Fourier transformation.