位置:成果数据库 > 期刊 > 期刊详情页
BOD:一种高效的分布式离群点检测算法
  • ISSN号:0254-4164
  • 期刊名称:《计算机学报》
  • 时间:0
  • 分类:TP301[自动化与计算机技术—计算机系统结构;自动化与计算机技术—计算机科学与技术]
  • 作者机构:东北大学信息科学与工程学院,沈阳110004
  • 相关基金:国家“九七三”重点基础研究发展规划项目基金(2012CB316201); 国家自然科学基金面上项目(61033007,61472070)资助
中文摘要:

离群点检测是数据管理领域中的热点问题之一,在许多方面都有着广泛应用,如信用卡诈骗、网络入侵检测、环境监测等.目前现有的离群点检测算法大多针对集中式的处理环境.但随着数据规模的不断增长,传统的集中式算法处理效率受限,无法满足用户日益增长的需求.针对上述问题,文中提出了一种新型的分布式离群点检测算法.首先,在数据存储阶段(即预处理),提出了BDSP(Balance Driven Spatial Partitioning)数据划分算法.该算法可以有效地均衡每个计算节点的工作负载,并实现良好的过滤效果.此外,为划分所得到的每个块设计了一种全新的编码方式,可以快速地确定块与块之间的相邻关系,降低网络开销.基于BDSP算法,提出了BOD(BDSP-based Outlier Detection)分布式离群点检测算法.该算法包括2个步骤:在每个计算节点本地,利用R树索引进行批量过滤,快速地计算离群点并得到本地候选集;利用BDSP中提供的块编码确定需要相互通信的节点,使用少量的网络开销得到最终结果.最后,通过大量实验验证了文中所提出的BDSP和BOD算法的有效性.实验结果表明,相对于现有算法,文中算法可以显著地提高计算效率并大幅降低网络开销.

英文摘要:

Outlier detection is a very hot topic in the area of data management, whose target is to find the objects which are very different from the rest of the data. The techniques of outlier detection can be applied to many fields such as credit card fraud detection, network intrusion detection, environment monitoring and so on. There have been a lot of scholars focusing on developing effective techniques to detect outliers, and a number of excellent approaches have been proposed. Unfortunately, most of the existing methods for outlier computing focus on the centralized processing environment. However, as the data volume increases, the processing efficiency of the traditional centralized methods becomes quite limited and cannot meet the users increasing requirements. To solve the problem above, in this paper, a novel distributed outlier detection algorithm is proposed for computing outliers in large-scale data sets. Firstly, in the data storage stage (i.e., preprocessing), the Balance Driven Spatial Partitioning algorithm (BDSP) is proposed to segment the whole data set into a number of small subsets and allocate these subsets to the corresponding computing nodes. The BDSP algorithm can effectively balance the workload of each computing node and achieve a good filtering effect. Furthermore, a new encoding method encoding method, BOD determines the computing nodes that need to communicate to each other, and outputs the final result from the candidate set with a small amount of network overhead. At last, in the experiments, we use a real data set and a series of synthetic data sets to verify the efficiency and effectiveness of BDSP and BOD proposed in this paper. The experimental results show that comparing with the previous approaches, our proposed algorithms can significantly improve the computation efficiency of outlier detection in a distributed environment and drastically reduce the network communication cost in the distributed processing.

同期刊论文项目
同项目期刊论文
期刊信息
  • 《计算机学报》
  • 北大核心期刊(2011版)
  • 主管单位:中国科学院
  • 主办单位:中国计算机学会 中国科学院计算技术研究所
  • 主编:孙凝晖
  • 地址:北京中关村科学院南路6号
  • 邮编:100190
  • 邮箱:cjc@ict.ac.cn
  • 电话:010-62620695
  • 国际标准刊号:ISSN:0254-4164
  • 国内统一刊号:ISSN:11-1826/TP
  • 邮发代号:2-833
  • 获奖情况:
  • 中国期刊方阵“双效”期刊
  • 国内外数据库收录:
  • 美国数学评论(网络版),荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊(2004版),中国北大核心期刊(2008版),中国北大核心期刊(2011版),中国北大核心期刊(2014版),中国北大核心期刊(2000版)
  • 被引量:48433