由No Free Lunch理论可知,没有一种聚类算法可完美的解决所有问题。算法推荐是解决此问题的一种有效手段,其核心是数据集相似性的度量。因此提出了一种计算数据集相似性的新方法,通过提取能揭示数据集内在分布和结构的几种属性,然后计算数据集几个属性间的距离,从而得到相似性的度量。首先选择了统计特征向量和二值化向量,然后对数据集进行划分,并计算划分中点到中心点的距离和点对之间的 robust path‐based距离得到数据集的紧凑性和连接性。再通过BP网络训练得到4个属性的参数,进而得到了数据集的相似性度量。选择8种人工数据集和8种UCI上的数据集建立数据集库,并选择了7种具有代表性的聚类算法组成算法库。选择 UCI上的部分数据集进行实验,结果表明本文提出的方法有较好的效果。
According to the No Free Lunch theory ,no clustering algorithm can solve all problems ,and it is difficult for users to select a suitable algorithm when a number of clustering algorithms are available .An algorithm recommendation system can be a potential solution .In this paper ,we propose a framework of clustering algorithm recommendation .Firstly ,a dataset and an algorithm library are constructed respectively ,and the mapping relationship between the datasets and the algorithms is established by evaluating the performance of the algorithms on the datasets .Then we devise a similarity measure of dataset by calculating the statistical characteristics ,binary vector ,compactness and connectedness attribute of the datasets and weighting the attributes with BP network .For the input dataset ,we find the most similar one in the dataset library by the similarity measure .Finally ,the recommended clustering algorithm can be achieved according to the mapping relationship between the datasets and algorithms .In the proposed framework ,eight artificial datasets and eight UCI real datasets are selected to construct the datasets library ,and seven representative clustering algorithms are used to form the algorithm library .The experiments on some UCI datasets demonstrate the proposed recommendation framework is with satisfact performance .