聚类分析是数据挖掘中的一个重要研究领域,由于聚类分析能够发现数据的内在结构并对数据进行更深入的分析或预处理,因此被用于图像处理、模式识别等诸多领域中。若用户数据被一些持有大数据集的组织(如医疗机构)利用挖掘工具获取个人隐私,将可能导致用户敏感信息面临泄露的威胁。为此,结合差分隐私的特性,提出了一种基于差分隐私保护的DPk-medoids聚类算法。该算法在每次发布真实中心点之前使用拉普拉斯机制对中心点加噪,再发布加噪之后的中心点,在一定程度上保证了个人隐私的安全性,以及聚类的有效性。真实数据集上的仿真实验结果表明,提出的聚类算法可以适应规模、维数不同的数据集,当隐私预算达到一定值时,DPk-medoids聚类算法与原始聚类算法的有效性比率范围可达0.9~1之间。
Cluster analysis is one of the significant research fields in the data mining. Due to its paramount advantages in identification of the internal data structure and pretreatment/analysis of the data,it can be used in fields of the image processing and pattern recognition and so on. Users' sensitive information could face leaking threats if mining tools are used to obtain the personal privacy by some organi- zations which own large datasets, such as medical companies. Therefore,taken into the characteristic of differential privacy account,a DPk -medoids algorithm based on differential privacy protection is proposed. It releases the noised center points before using Laplace mecha- nism to add noise,and in certain degree,personal privacy security and the effectiveness of clustering can be ensured. Experimental results with the ture datasets show that it can be applied to datasets with different scales and dimensions and moreover the range of effective ratio can reach to 0.9 ~1 compared with original clustering algorithm when the privacy budget reaches a certain value.