对于区间型符号数据聚类分析的研究,现有方法大多假设个体在区间内服从均匀分布,这往往并不符合实际情况.针对此问题,研究一般分布的区间型符号数据K均值聚类方法,给出了一般分布区间型符号数据的定义,并基于经验分布理论研究其描述统计.基于Hausdorff距离,考虑区间数所包含个体的分布信息,提出了一种新的区间型符号数据距离度量.给出了一般分布的区间型符号数据K均值聚类算法.通过随机模拟试验对该方法进行了有效性评价,结论表明,在各种实验设计的条件下,考虑一般分布的K均值聚类算法有效性均优于均匀分布假设下的K均值聚类算法.最后将文中方法应用于汽车的聚类分析,进一步体现了文中方法在解决实际问题中的优势.
The existed clustering methods of interval data mostly supposed that the data are uniformly distribu- ted across the interval. However, this is not always practical. Taking this into account, this paper aims to re- search the k-means clustering method of interval data with a general distribution. The definition of generally distributed interval data is proposed, and descriptive statistics was researched based on empirical distribution theory. On the basis of Hausdorff distance, the paper puts forward a new distance for interval data, which con- siders the point data contained in the intervals. Based on this, we present a algorithm of k-means clustering of generally distributed interval symbolic data. A simulation experiment is conducted to evaluate the validity of our method. The results show that, compared with analysis methods of uniform interval symbolic data, the a- nalysis methods of generally distributed interval symbolic data are more effective under all the conditions de- signed in our experiment. Finally, the method is illustrated by an example of real-case data which shows the advantages of our method in the practical application.