针对传统隐私保护方法无法应对任意背景知识下恶意分析的问题,提出了分布式环境下满足差分隐私的k-means算法。该算法利用Map Reduce计算框架,由主任务控制k-means迭代执行;指派Mapper分任务独立并行计算各数据片中每条记录与聚类中心的距离并标记其属于的聚类;指派Reducer分任务计算同一聚类中的记录数量num和属性向量之和sum,并利用Laplace机制产生的噪声扰动num和sum,进而实现隐私保护。根据差分隐私的组合特性,从理论角度证明整个算法满足ε-差分隐私保护。实验结果证明了该方法在提高隐私性和时效性的情况下,保证了较好的可用性。
Aiming at the problem that traditional privacy preserving methods were unable to deal with malign analysis with arbitrary background knowledge, a k-means algorithm preserving differential privacy in distributed environment was proposed. This algorithm was under the computing framework of MapReduce. The host tasks were obligated to control the iterations of k-means. The Mapper tasks were appointed to compute the distances between all the records and cluster- ing centers and to mark the records with the clusters to which the records belong. The Reducer tasks were appointed to compute the numbers of records which belong to the same clusters and the sums of attributes vectors, and to disturb the numbers and the sums with noises made by Laplace mechanism, in order to achieve differential privacy preserving. Based on the combinatorial features of differential privacy, theoretically prove that this algorithm is able to fulfill e-differentially private. The experimental results demonstrate that this method can remain available in the process of preserving privacy and improving efficiency.