关系数据库数据质量的一个主要问题是存在数据不一致现象.为找出不一致数据,需要进行函数依赖冲突检测.集中式数据库中可以通过SQL技术检测不一致情况,而分布式环境下的函数依赖冲突检测更富有挑战性,特别是大数据背景下,这个问题尤为突出.分布式环境下的函数依赖冲突检测通常需要进行数据迁移,而且不同的数据迁移方法会对检测效率产生一定的影响.该文提出了一种基于等价类的分布式环境多个函数依赖冲突检测的方法,给出了冲突检测的响应时间代价模型.由于分布式环境函数依赖冲突检测问题的任务分配问题为NP-难问题,多项式时间内难以得到最优解,该文将不一致性检测响应时间最小化问题转化为整数规划问题,并给出了近似最优解.针对集群规模和函数依赖个数大小不同的情况,分别给出了不同的任务分配策略,并在检测过程中实现了动态负载均衡,有效提高了负载均衡度和检测效率.在真实和人工数据集上的实验表明,相对于集中式检测方法以及基于Hadoop的naive方法,该文提出的多函数依赖冲突检测方法检测效率有明显的提升,且在数据规模、节点个数和函数依赖个数等方面扩展性能良好.
One major problem of data quality in relational database is data inconsistency.To find out the inconsistent data in the relational database,we need to detect the functional dependency violations.It is easy to detect dependency violations in centralized databases via SQL-based techniques.However,it is far more challenging to check dependency violations in distributed databases,especially with big data.It is usually necessary to ship data from one site to another when detecting functional dependency violations from distributed data.Moreover,different data migration methods may have different impact on the detection efficiency.This paper proposes a novel equivalence class based multiple functional dependency violations detection approach in distributed big data,and provides a cost model of violations detection.Considering that the inconsistency detection problem is NP-hard,it is impossible to find an optimal solution in polynomial time,so we transform the problem of minimizing response time of inconsistency detection into an integer programming problem and provide an optimal solution for the allocation of detecting tasks.Against difference of cluster size and the number of functional dependencies,we propose different tasks allocation strategies,and achieve dynamic load balancing in the detection process,which can improve the detection efficiency and load balancing degree effectively.Experiments onreal-world and generated datasets demonstrate that compared with previous detection methods and nave method based on Hadoop platform,our approach is more effective in efficiency and with good scalability on the number of nodes,on the size of datasets and on the number of functional dependencies.