各个点在数据内部的组织结构中自然地扮演着3种不同的结构性角色,分别是毂、质心和野值.在基于邻域的聚类算法中,邻域密度因子能够识别分离数据集中的毂、质心和野值.但是,邻域密度因子对有噪声和重叠的数据往往失效.为了解决该问题,引入了基于多项式核的邻域密度因子,并在有向树框架下,提出了一种结构化的数据聚类算法,其计算复杂度线性于输入数据的大小.对带有噪声和重叠的数据集,该算法能够找到所有显著的、任意形状的不均衡聚类.在人工和真实数据集上的实验结果都证实了该算法的有效性和快速性.
Within the internal organization of the data, the data points respectively play three different structural roles: the hub, centroid and outlier. The neighborhood-based density factor (NDF) used in the neighborhood based clustering (NBC) algorithm has the ability of identifying which points act as hubs, centriods or outliers in separated-well data set. However, NDF often works poorly in the circumstances of noise and overlapping. This paper introduces a polynomial kernel based neighborhood density factor (PKNDF) to address this issue. Relying on the PKNDF, a structural data clustering algorithm is further presented which can find all salient clusters with arbitrary shapes and unbalanced sizes in a noisy or overlapping data set. It builds clusters into the framework of directed trees in graph theory and thereby each point is scanned only once in the process of clustering. Hence, its computational complexity is nearly linear in the size of the input data. Experimental results on both synthetic and real-world datasets have demonstrated its effectiveness and efficiency.