针对数据中多视角模式挖掘的问题,提出一个基于IB方法的无冗余多视角聚类算法:NrMIB.该算法一方面采用IB思想来最大化地保存聚类结果中的信息量,以确保高质量的聚类结果;另一方面通过最小化聚类结果与已知数据划分模式间的互信息来确保新的聚类结果相对于已知划分模式是无冗余的.NrMIB算法既适宜于分析共现数据,又适宜于分析欧氏空间非共现数据,可挖掘出数据中线性及非线性可分模式,无需额外参数来估算欧氏空间的信息量.在人工构造数据模式识别、人脸识别和文档聚类上的实验结果表明,NrMIB算法可有效地挖掘出数据中所蕴含的多个合理划分模式,性能优于传统单视角聚类算法及3个现有的无冗余多视角聚类算法.
Typical clustering algorithms output a single partition of the data. However, in real world applications, data can often be interpreted in many different ways and has different reasonable partitions from multiple views. Instead of committing to one clustering solution, here we introduce a novel algorithm, NrMIB (non-redundant multi-view information bottleneck), which can provide several non-redundant clustering solutions from multiple views to the user. Our approach employs the information bottleneck fIB) method, which aims to maximize the relevant information preserved by clustering results, to ensure the qualities of the clustering solutions, whilst the mutual information between the clustering labels and the known data partitions is minimized to ensure that the new clustering solutions are non-redundant. By adopting the mutual information and MeanNN differential entropy to estimate the preserved information, the NrMIB can be used to analyze both co-occurrence data and Euclidean space data. Besides, our algorithm is also suitable to analyze high dimension data, and can discover both linear and non-linear cluster shapes. We perform experiments on synthetic data pattern recognition, face recognition, and document clustering to assess our method against a large range of clustering algorithms in the literature. The experimental results show that the proposed NrMIB algorithm can discover the multiple reasonable partitions resided in the data, and the performance of NrMIB is superior to three non-redundant multi-view clustering algorithms examined here.