为聚类非线性相关的数据对象,引入广义信息论中二次互信息作为相似性度量,利用矩阵理论降低了二次互信息的计算量,并结合滑动窗口技术,建立了一种时序数据非线性相关模型.在此基础上提出了适用于时序基因表达数据的确定性联合聚类算法MI—TSB.该算法将时序数据转化为抽象字符序列,然后插入到MI-泛化后缀树中,避免了穷举各种组合,从而快速索引全部聚类结果.实验结果显示MI—TSB算法具有良好的运行性能,成功聚类出非线性相关的对象;利用Gene Ontology对聚类结果进行基因注释,也验证了聚类结果的生物学意义.
The biclustering algorithms focus on clustering correlated patterns in sub-spaces. However, most of the biclustering algorithms nowadays address only the linearly correlated pattern or a certain linearly similar pattern, leaving the nonlinearly correlated patterns untouched, which are often hidden in a great many of real data sets. In this paper, a novel biclustering algorithm called MI TSB is proposed to find and report all nonlinearly correlated patterns in time series gene expression data. It first deduces an efficient calculating formula of quadratic mutual information with matrix theory, and then based on the quadratic mutual information and sliding window technology, a time series data nonlinearly similar model and a simple general suffix tree variation version are introduced. Using suffix tree as index structure, the MI-TSB algorithm explores all of biclusters effectively and efficiently. Compared with general biclustering algorithms, the ability of discovering the nonlinearly correlated patterns in sliding window is one of the most important advantages of the MI-TSB algorithm.Additionally, experiments on real gene expression dataset and synthetic dataset show that the MI-TSB algorithm successfully discovers some nonlinearly correlated patterns which can not be found by other ordinary biclustering algorithms. Besides, gene annotating by gene ontology demonstrates that the MI-TSB algorithm can find biologically meaningful results.