东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

一种基于划分的孤立点检测算法

期刊名称：软件学报.2006,Vol.17(5): 1009-1016 EI: 06289996303
时间：0
分类：TP181[自动化与计算机技术—控制科学与工程;自动化与计算机技术—控制理论与控制工程]
作者机构：[1]东北大学信息科学与工程学院,辽宁沈阳110006, [2]沈阳建筑大学信息与控制工程学院,辽宁沈阳110015
相关基金：Supported by the National Natural Science Foundation of China under Grant Nos.60473073, 60573090, 60173051 （国家自然科学基金）; the Foundation of Teaching and Research Award Program for Outstanding Young Teachers in Higher Education Institution of China（国家教育部高等学校优秀青年教师教学和科研奖励基金）; the Natural Science Foundation of Liaoning Province of China under Grant No.20052006 （辽宁省自然科学基金）; the Key Technologies Plan of Liaoning Education Department of China under Grant No.05L354（辽宁省教育厅关计划基金）
相关项目：支持嵌入式计算的非线性实时数据流管理技术的研究

关键词：数据挖掘, 孤立点检测, 划分, CD-Tree(cell, DIMENSION, tree), 基于单元的算法, data mining, outlier detection, partition, CD-tree （cell dimension tree）, cell-based algorithm

中文摘要：

孤立点是不具备数据一般特性的数据对象.划分的方法是通过将数据集中的数据点分布的空间划分为不相交的超矩形单元集合,匹配数据对象到单元中,然后通过各个单元的统计信息来发现孤立点.由于大多真实数据集具有较大偏斜,因此划分后会产生影响算法性能的大量空单元.由此,提出了一种新的索引结构——CD-Tree（cell dimension tree）,用于索引非空单元.为了优化CD-Tree结构和指导对数据的划分,提出了基于划分的数据偏斜度（skew of data,简称SOD）概念.基于CD-Tree与SOD,设计了新的孤立点检测算法.实验结果表明,该算法与基于单元的算法相比,在效率及有效处理的维数方面均有显著提高.

英文摘要：

Outliers are objects that do not comply with the general behavior of the data. The method of partition divides data space into a set of non-overlapping rectangular cells by partitioning every dimension into equal length. Statistical information of cells is used to find knowledge in datasets, There exists very large data skew in real-life datasets, so partition will produce many empty cells, which affects the efficiency of the algorithms. An efficient index structure called CD-Tree （cell dimension tree） is designed for indexing cells, Moreover, to guide partition and to optimize the structure of CD-Tree, the concept of SOD （skew of data） is proposed to measure the degree of data skew. Finally, the CD-Tree-based algorithm is designed for outlier detection based on CD-Tree and SOD. The experimental results show that the efficiency of CD-Tree-based algorithm and the maximum number of dimensions processed increase obviously comparing with the Cell-based algorithm on real-life datasets.

同期刊论文项目