针对现有数据发布隐私保护保护算法中的“局部最优”划分问题,提出了一种基于KD树最优投影划分的k匿名算法.首先,在全局范围内对每一个属性维度进行遍历,根据投影距离方差值衡量每个维度的离散度,并确定最优维度;然后,在最优属性维度上,计算其划分系数值,并确定最优划分点.进一步引入一种改进的KD树结构,与传统的KD树结点是一个数据点不同,新设计的KD树中的每个结点均是一个集合.用经过划分点并垂直于最优维度的超平面将一个结点分成两部分,分别作为其左、右孩子结点.最后通过理论分析证明了本文算法的正确性,用实验比较和验证了算法的性能,实验结果显示所提算法平均概化范围减小10%~22%,能够实现更优的划分和更好的数据集可用性.
Data should be provided to relevant decision makers and researchers.Privacy protection should be carried out before the release because the data contains personal sensitive and privacy information.k-anonymity algorithm is an important algorithm in privacy protection.And partition is a significant method in k-anonymity mechanism.For sake of solving the problem of“local optimization”in the existing privacy preserving algorithm for data publishing,a k-anonymity algorithm based on optimized proj ection partition of KD-tree is proposed in this paper.Firstly,the optimized attribute dimension is determined by the dispersion degree according to the proj ection distance variance value in the global domain.Then,in the optimal attribute dimension,the value of the partition coefficient is calculated,and the optimal partition point is determined.Furthermore,an improved KD-tree structure is introduced in this paper.And this structure can solve the problem efficiently.The new designed KD-tree node is a collection rather than a data point in traditional KD-tree.A KD-tree node is divided into two parts as its left child node and its right child node by the hyper plane which pass through the dividing point and is perpendicular to the optimal dimensional.Finally,the validity of the proposed algorithm is verified by theoretical analysis and the performance of the algorithm is demonstrated by comparing experiments.The experimental results show that the proposed algorithm can reduce the average generalization range by 22%to 1 0%,which can achieve a better division and better data set availability.