本文针对sIB算法仅适用于共现数据的问题,提出了一种能够自动进行范畴类型数据分析的sIB算法:CD-sIB.该算法根据范畴类型数据的离散化表示、不同属性值有限的特征,进行数据的属性的拓展和二元化处理,基于属性值的出现进行X,Y的联合分布的计算,使得sIB算法可有效应用于范畴类型数据的分析.实验结果表明:CD-sIB算法相对于现有的面向范畴类型数据聚类模式分析的算法GAClust和K-modes具有明显的优势;CD-sIB算法在进行数据属性概化程度高、类数据分布相对平衡的范畴类型数据的分析中,在效率和精确度方面均很突出.
The sIB algorithm has previously been only applied to the analysis of co-occurence data.Therefore,it cannot directly analyze categorical data that do not appear in the form of co-occurrence of two variables X,Y.Aiming to solve the problem,this paper proposes a CD-sIB algorithm for automatically analyzing categorical data based on the theory of sIB algorithm.According to the nature that categorical data is discrete and its distinct attribute value is finite,CD-sIB algorithm counts joint distribution of relevant variable X,Y based on the occurence frequency of attribute value by extending the attributes of dataset and utilizing binarization to process the categorical data.Consequently,our algorithm can be effectively employed in analyzing the categorical data.As shown by our experimental results,CD-sIB outperforms the GAClust and the K-modes algorithm,and it achieves high precision and efficiency in analyzing categorical data,especially in the analysis of categorical data which is highly generalizable and comparatively balanced in the data distribution of each class.