针对生物序列模体的识别问题,提出了一个新的混合Gibbs抽样识别算法.算法基于混合模体模型学习,采用贪心策略,通过似然度最大化,逐次将新的模体加入到混合模型中.算法中设计了位点抽样和模体抽样两种抽样方法,这两种抽样方法交替进行.为了加速搜索过程,对输入数据集采用了基于kd—trees的分层划分策略.实验结果表明,该算法对序列家族大量模体特征的识别具有显著优势,并且可建立更具统计特征的模体模型,从而提高序列分类的准确性.
For the motif discovery problem of biological sequences, a mixture Gibbs sampling algorithm is presented. Based on mixture motifs model learning through likelihood maximization, a greedy strategy that adds sequentially new motif to a mixture model is employed, Two sampling methods are designed, site sampling and motif sampling, the two sampling methods are applied by turns. In order to speed up the searching procedure, a hierarchical partitioning scheme based on kd-trees is used for partitioning the input dataset. Experimental results indicate that the proposed algorithm is adyantageous in identifying larger groups of motifs characteristic of biological families. In addition, it offers better diagnostic capabilities by building more powerful statistical motif models with improved classification accuracy.