数据起源主要描述数据的来源及随时间演化的过程。最小化查询数据表的标识属性传播是一个亟待解决的问题。通过构建等值传播链表EPL描述查询中的等值连接及其传递性,并基于EPL给出朴素标识属性传播方法实现高效溯源信息传播。然而标识属性通过等值连接可以识别非标识属性数值,简单地传播数据表的标识属性数值导致起源数据冗余传播。为避免溯源信息冗余,提出完全标识属性传播格及其剪枝策略,给出基于格剪枝的最优标识属性传播方法,实现溯源信息的最小代价传播。基于TPC-H Benchmark和人造数据集IAP-DB的实验结果验证了提出的基于标识属性传播的溯源方法可以高效实现数据起源信息传播。
Data provenance describes the origin and the history of derived data.How to minimize identifier propagation in relational databases is a challenge problem.An equal value propagation list(EPL) is built to describe propagations based on equal values.Nave identifier propagation method is proposed based on EPL to efficiently propagate data provenance.However,simply propagating a relation's identifying attributes may result in redundantly propagating provenances because that identifier can identify non-identifiers through equal join operations.In order to avoid such redundancy,a complete identifier propagation lattice and corresponding pruning strategies is proposed.A lattice-pruning optimal identifier propagation method is proposed,which can efficiently propagate data provenance in relational databases with the minimized cost.Experimental results on TPC-H Benchmark and synthetic data IAP-DB show that our provenance tracing method based on propagating identifiers can efficiently propagate data provenance in relational database.