基于反向k近邻的孤立点检测算法能够从全局角度较好地检测孤立点,但是在初始阶段求数据点的k近邻时,基本算法需要O(KN^2)次数据点间的距离计算,不适合大数据集。同时参数k值的选取对数据集中孤立点的确定产生很大的影响。为此采用自适应的方法确定参数k值,然后提出一种利用度量空间的三角不等式的快速挖掘算法提前剪枝,减少孤立点检测时数据点之间距离计算的次数。理论分析和实验结果证明了算法的可行性和高效性。
The outlier detection algorithm based on reverse k-nearest neighbour can better detect the outliers in terms of global perspective, but when calculating the k-nearest neighbour of the data in initial stage, the basic algorithm requires the distance calculations between data points at the times of O(KN2 ) , so it is not suitable for large data sets, while the selection of the parameters k has a great impact on pointing the outliers in the data set. In the paper, we use adaptive method to determine the value of the parameter k , and then propose to prune in advance by using a triangle inequality fast mining algorithm in metric space, so as to reduce the times of calculations in regard to the distance between the data points when detecting the outliers. Theoretical analysis and experimental results all demonstrate the feasibility and efficiency of the algorithm.