离群点检测是数据管理领域中的热点问题之一,在许多方面都有着广泛应用,如信用卡诈骗、网络入侵检测、环境监测等.目前现有的离群点检测算法大多针对集中式的处理环境.但随着数据规模的不断增长,传统的集中式算法处理效率受限,无法满足用户日益增长的需求.针对上述问题,文中提出了一种新型的分布式离群点检测算法.首先,在数据存储阶段(即预处理),提出了BDSP(Balance Driven Spatial Partitioning)数据划分算法.该算法可以有效地均衡每个计算节点的工作负载,并实现良好的过滤效果.此外,为划分所得到的每个块设计了一种全新的编码方式,可以快速地确定块与块之间的相邻关系,降低网络开销.基于BDSP算法,提出了BOD(BDSP-based Outlier Detection)分布式离群点检测算法.该算法包括2个步骤:在每个计算节点本地,利用R树索引进行批量过滤,快速地计算离群点并得到本地候选集;利用BDSP中提供的块编码确定需要相互通信的节点,使用少量的网络开销得到最终结果.最后,通过大量实验验证了文中所提出的BDSP和BOD算法的有效性.实验结果表明,相对于现有算法,文中算法可以显著地提高计算效率并大幅降低网络开销.
Outlier detection is a very hot topic in the area of data management, whose target is to find the objects which are very different from the rest of the data. The techniques of outlier detection can be applied to many fields such as credit card fraud detection, network intrusion detection, environment monitoring and so on. There have been a lot of scholars focusing on developing effective techniques to detect outliers, and a number of excellent approaches have been proposed. Unfortunately, most of the existing methods for outlier computing focus on the centralized processing environment. However, as the data volume increases, the processing efficiency of the traditional centralized methods becomes quite limited and cannot meet the users increasing requirements. To solve the problem above, in this paper, a novel distributed outlier detection algorithm is proposed for computing outliers in large-scale data sets. Firstly, in the data storage stage (i.e., preprocessing), the Balance Driven Spatial Partitioning algorithm (BDSP) is proposed to segment the whole data set into a number of small subsets and allocate these subsets to the corresponding computing nodes. The BDSP algorithm can effectively balance the workload of each computing node and achieve a good filtering effect. Furthermore, a new encoding method encoding method, BOD determines the computing nodes that need to communicate to each other, and outputs the final result from the candidate set with a small amount of network overhead. At last, in the experiments, we use a real data set and a series of synthetic data sets to verify the efficiency and effectiveness of BDSP and BOD proposed in this paper. The experimental results show that comparing with the previous approaches, our proposed algorithms can significantly improve the computation efficiency of outlier detection in a distributed environment and drastically reduce the network communication cost in the distributed processing.