分层聚类技术在图像处理、入侵检测和生物信息学等方面有着极为重要的应用,是数据挖掘领域的研究热点之一.针对目前并行分层聚类算法处理大数据集时速度较慢的特点,提出一种并行数据预处理算法,该算法可使原始输入数据的规模最多减少为原来的1/10,从而可减少总的并行分层聚类时间.在测试数据集上的实验结果表明使用本算法进行预处理后,能显著减少分层聚类的运行时间.
Hierarchial clustering technology plays a very important role in image processing, intrusion detection and bioinformatics applications, which is one of the most extensively studied branch in data mining. Presently the parallel hierarchical algorithms aren't very good at processing large data. To overcome this shortcoming, a new parallel data preprocessing algorithm based on Hierarchical Clustering is proposed in this paper. This algorithm can reduce the scale of data and runtime accounting for one tenth of it in the best situation. The experiment shows the effectiveness of our algorithm.