东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

Mining Frequent Itemsets in Correlated Uncertain Databases

ISSN号：1000-9000
期刊名称：《计算机科学技术学报：英文版》
分类：TP311.13[自动化与计算机技术—计算机软件与理论;自动化与计算机技术—计算机科学与技术] TP391.41[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]State Key Laboratory of Software Development Environment, School of Computer Science and Engineering Beihang University, Beijing 100191, China, [2]Department of Computer Science and Engineering, The Hong Kong University of Science and Technology Hong Kong, China
相关基金：This work is partially supported by the Hong Kong RGC Project under Grant No. N_HKUST637/13, the National Basic Research 973 Program of China under Grant No. 2014CB340303, the National Natural Science Foundation of China under Grant Nos. 61328202 and 61300031, Microsoft Research Asia Gift Grant, Google Faculty Award 2013, and Microsoft Research Asia Fellowship 2012.

关键词：频繁项集, 数据挖掘, 数据库, 频繁模式挖掘, 概率模型, 不确定数据, 关联, 普适计算, correlation, uncertain data, probabilistic frequent itemset

中文摘要：

最近与事情(IoT ) 和弥漫的计算的因特网的成长流行，大量不明确的数据，例如， RFID 数据，传感器数据，即时录像数据，被收集了。作为不明确的数据采矿的最基本的问题之一，不明确的经常的模式采矿在数据库和数据采矿社区吸引了许多注意。尽管不明确的经常的模式采矿有一些解决方案，他们中的大多数假设数据是独立的，它不在很真实世界的情形是真的。因此，基于独立假设的当前的方法可以为相关不明确的数据产生不精密的结果。在这份报纸，我们在相关不明确的数据上在采矿的问题上集中经常的 itemsets，在关联能在任何个不明确的数据目标(交易) 存在的地方。我们建议一个新奇概率的模型，打电话给相关经常的概率模型(CFP 模型) 在给定的相关不明确的数据集代表支持的概率分发。基于从 CFP 模型导出的支持的分发，我们观察到某概率的经常的 itemsets 在有高积极的关联的几宗交易仅仅是经常的。特别地， itemsets，它全球概率经常，在在数据消除存在噪音和关联的影响有更多的意义。以便减少冗余的经常的 itemsets，我们进一步建议模式的一种新类型，叫了全球概率的经常的 itemsets，不明确的数据库被划分识别如果全部相关，在每组交易总是是经常的 itemsets 成拆散组基于他们的关联。加快采矿进程，我们也设计一个动态编程解决方案，以及二修剪并且跳技术。真实、合成的数据集的广泛的实验验证建议模型和算法的有效性和效率。

英文摘要：

Recently, with the growing popularity of Internet of Things （IoT） and pervasive computing, a large amount of uncertain data, e.g., RFID data, sensor data, real-time video data, has been collected. As one of the most fundamental issues of uncertain data mining, uncertain frequent pattern mining has attracted much attention in database and data mining communities. Although there have been some solutions for uncertain frequent pattern mining, most of them assume that the data is independent, which is not true in most real-world scenarios. Therefore, current methods that are based on the independent assumption may generate inaccurate results for correlated uncertain data. In this paper, we focus on the problem of mining frequent itemsets over correlated uncertain data, where correlation can exist in any pair of uncertain data objects （transactions）. We propose a novel probabilistic model, called Correlated Frequent Probability model （CFP model） to represent the probability distribution of support in a given correlated uncertain dataset. Based on the distribution of support derived from the CFP model, we observe that some probabilistic frequent itemsets are only frequent in several transactions with high positive correlation. In particular, the itemsets, which are global probabilistic frequent, have more significance in eliminating the influence of the existing noise and correlation in data. In order to reduce redundant frequent itemsets, we further propose a new type of patterns, called global probabilistic frequent itemsets, to identify itemsets that are always frequent in each group of transactions if the whole correlated uncertain database is divided into disjoint groups based on their correlation. To speed up the mining process, we also design a dynamic programming solution, as well as two pruning and bounding techniques. Extensive experiments on both real and synthetic datasets verify the effectiveness and e？ciency of the proposed model and algorithms.

同期刊论文项目