东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于计数的数据流频繁项挖掘算法

ISSN号：1000-1239
期刊名称：《计算机研究与发展》
时间：0
分类：TP391.41[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]复旦大学计算机科学技术学院,上海201203
相关基金：高等学校博士学科点专项科研基金项目（20090071120092）;IBMCRLUR基金项目（JSA201007005）

关键词：频繁项, Top—K, 数据流, 数据挖掘, 频数估计, words frequent item, Top-K, data stream, data mining, frequency estimation

中文摘要：

挖掘数据流的频繁项已受到广泛关注，经典的频繁项挖掘算法尽管能够比较好地找到频繁项，但对频繁项频数的估计往往存在较大误差．SRoEC（segment rotative efficient count），SReEC（segmentreserve efficient count）和RFreq（reserv efrequent）算法针对该问题，继承基于计数的算法思想，将计数器进行划分并定义相应的操作，以期提高频数统计准确度并减小“噪音”影响．实验和数据分析表明，这些算法不仅能够保证频数超过阈值的数据项都能被找到，而且大大提高了频繁项频数统计的准确性．在同样空间代价下，算法无论在模拟数据集和真实数据集实验中，都表现出较高的频数准确率、较低的频数偏差率和较高的频数保有率，尤其是数据分布较平缓时，算法优势更加明显．

英文摘要：

Mining frequent items over data stream has drawn great attention, and large amount of efficient algorithms have been proposed by many researchers over the past decades. Although the classical algorithms are well suited to find frequent items, usually they do not perform well when estimating items＇ approximate frequency. To solve this problem, we introduce a series of counter- based algorithms called SRoEC （segment rotative efficient count）, SReEC （segment reserve efficient count） and RFreq （reserve frequent）. They divide the counter used in classical algorithms and define operations for counters to improve the accuracy of item frequency and avoid the effect of low frequency items. As the experience shows, these algorithms can find Top-K items above the threshold correctly and return their approximate frequency as accurate as possible. Both analysis and experiments demonstrate that under same cost of space, these algorithms return higher count accuracy rate, lower frequency error rate and higher frequency reserve rate on both simulated data set and real data set when compared with the two best classical algorithms （frequent algorithm and space saving algorithm） nowadays. Amongst them, RFreq algorithm shows obvious advantages. What＇s more, the algorithms perform much better than classical ones when the data distribution is smooth.

同期刊论文项目

　大规模数据流上的异常监测和预测关键技术的研究

期刊论文 1

同项目期刊论文

期刊信息

《计算机研究与发展》
中国科技核心期刊

主管单位:中国科学院
主办单位:中国科学院计算技术研究所
主编：徐志伟
地址：北京市科学院南路6号中科院计算所
邮编：100190
邮箱：crad@ict.ac.cn
电话：010-62620696 62600350

国际标准刊号：ISSN：1000-1239
国内统一刊号：ISSN：11-1777/TP
邮发代号:2-654

获奖情况:
2001-2007百种中国杰出学术期刊，2008中国精品科...,中国期刊方阵“双效”期刊

国内外数据库收录:
俄罗斯文摘杂志,荷兰文摘与引文数据库,美国工程索引,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）,中国北大核心期刊（2000版）

被引量:40349