针对可还原数据扰动(retrievable general additive data perturbation,RGADP)算法在保护数据库隐私时会影响数据挖掘结果的问题,提出一种利用贝叶斯原理在扰动数据上进行分类的方法。该方法分析RGADP算法过程,利用贝叶斯原理,根据扰动数据推算原始数据的概率分布,用估算的概率分布重构数据,并对重构数据进行分类以提高分类的正确性。实验结果表明:该方法估算出的概率分布与原始数据概率分布接近,且重构数据的分类正确率相比扰动数据而言平均可提高4%以上,其更接近原始数据的分类正确率,从而有效地降低了扰动算法对数据分类的影响;该方法的运行时间与数据量和数据分组数成正比,重构10 000条数据的运行时间在200ms以内,因此该方法也具有较高的效率。
A classification method for perturbed data using the Bayesian rule is presented to solve the problem that the result of data mining is affected when the retrievable general additive data perturbation(RGADP)algorithm is used to preserve privacy in database.The process of RGADP algorithm is analyzed,and the Bayesian rule is used to estimate the probability distribution of original data from the perturbed data.Then,new data are reconstructed from the estimated probability distribution and are classified to increase the accuracy of classification.Experimental results show that the probability distribution estimated by the proposed method is close to the original probability distribution.Comparison with the classification accuracy of perturbed data shows that the classification accuracy of the reconstructed data increases by more than 4% in average,and is closer to the original classification accuracy.Thus,the method can effectively reduce the effect of the perturbation algorithm on classification.Moreover,the running time of the method is proportional to the amount of data and the number of groups.The method costs less than 200 ms to reconstruct 10 thousands data,and has a high efficiency.