对某种生物而言,在某段连续时间内共表达的基因预示着其在同时完成某一生物过程或其间存在某种调控关系;而目前在基因表达数据上的大多数双聚类算法都是针对非连续样本点的情况提出的,对于连续样本点(样本之间存在顺序关系)的情况很少涉及。因此在考虑连续样本点的情况下,提出了一种在时序基因表达数据上挖掘极大一致趋势共表达基因集的双聚类算法TCBicluster。在每个时间点产生行常量共表达基因集,进而构造以时间点为顶点、以相邻时间点间满足一致性要求的共表达基因集为边的权值图,并采用扩展连续时间点的方式对权值图进行双聚类挖掘,使用有效的剪枝策略提高算法效率。实验证明,TCBicluster算法比RAP及CC-TSB算法更能有效挖掘极大一致趋势共表达双聚类且具有较高的效率和良好的可扩展性。
For one creature, if some genes on it show co-expressed in a certain continuous time interval, they are very likely to complete a biological process simultaneously or exist some regulation relationships. At present, most of the bicluster algo- rithms in gene expression data were proposed under the discontinuous samples. That is, the bicluster algorithms for samples existing a sequential relationship were very few. For this reason, this paper proposed an efficient time-continuous bicluster al- gorithm TCBieluster to mine the maximal coherent evolution and co-expression gene sets from the time-series microarray gene expression dataset. First, TCBicluster algorithm generated all the constant row co-expression gene sets for every time point. Then, it built the weighted range multigraph which used the time points as its vertexes and the co-expression gene sets with co- herent evolution between two adjacent time points as its edges. Finally, TCBicluster expanded the multigraph with a mode that only considered the behind adjacent vertex as the candidate. In addition, it used some efficient pruning techniques to improve the efficiency. The experimental results show that the maximal coherent evolution and co-expression biclusters mined by TCBi- cluster algorithm are of better quality than RAP and CC-TSB. Simultaneously, TCBicluster algorithm also indicates higher mining efficiency and better extensibility.