随着信息产业的迅猛发展,聚类的无监督特性使其成为一种极为有效的分析工具.而为获得良好的聚类结果,有效及准确的相似度计算方法是其必备的前提条件.事实上,在描述数据相似度时,不同的特征显然具有不同的作用,因此有必要借助一些先验知识,例如用户提供的限制数据,来衡量特征的重要性,并将其应用于相似度计算中以获取更加准确的计算结果.传统的特征权值量化方法均忽视了两点问题:(1)限制数据在特征空间中极有可能为非均匀分布;(2)限制数据可能包含不一致性.上述问题的存在使得传统的权值量化方法无法获得准确的结果甚至无法运行.基于此,文中提出了一种新颖的特征权值量化方法用以处理上述两点问题:(1)将限制数据划分为若干个等价类,进而通过计算参数"分布系数"来均匀化数据的分布;(2)将限制数据连接为无向图,进而通过计算参数"置信度"来衡量及弱化限制数据的不一致性.之后将这两个参数结合到特征权值量化函数中以获得准确的相似度计算结果.实验结果显示:该特征权值量化方法能够结合限制数据来获取不同特征对相似度计算的贡献能力,并能应用于任何聚类算法中以提高聚类的准确度.
Along with high-speed advance of information technology,the unsupervised characteristic of clustering makes itself an effective implement for data analysis.To acquire high clustering performance,the effective and precise similarity calculation plays a prime and necessary role for clustering algorithms.Owing to the fact that different features have diverse contributions to describe similarity among data,it is necessary to assess feature's contribution by means of some transcendental knowledge(e.g.constrained data provided by users),and import it in similarity measurement to acquire more precise calculating results.Unfortunately,conventional weight evaluating methods all fail to consider two challenges:(1)high possibility of asymmetrical distribution of constrained data in feature space;(2)high possibility of inconsistency contained by constrained data.Previous two issues disable conventional weight evaluating methods to acquire high precision,and even make them unable to work.Hence,this paper proposes a novel constraint based weight evaluating method to deal with them.For the former one,constrained data are partitioned into several equivalent classes,and distributing parameters are assigned to them to balance theirdistributions.For the latter one,constrained data are connected to form an undirected graph,and belief values are thereby computed to measure and reduce their possibilities to be inconsistent.Finally,these two parameters are integrated in weight evaluating function to form an accurate similarity measurement.Experimental results demonstrate that,this weight evaluating method can combine constrained data to obtain diverse contributions of different features to similarity calculation,and can be applied in any clustering algorithm to improve its precision.