在科技高速发展的今天,海量数据处理问题受到人们广泛关注。将K-means聚类算法与Hadoop平台相结合是处理海量数据问题的一条可靠途径。简单介绍Hadoop和K-means算法以及K-means聚类算法MapReduce并行化实现,并阐述目前Hadoop平台下K-means算法的几种优化方式,最后提出研究展望。
Today, with the rapid development of science and technology, more and more people pay attention to the prob- lem of massive data processing. The combination of K-means clustering algorithm and Hadoop platform is a reliable way to deal with massive data problems. In this paper, we do a brief introduction about Hadoop and K-means algorithm and paral- lel implementation of K-means clustering algorithm based on MapReduce. At the same time, we do a introduction and elab- oration about several optimization methods of K-means algorithm based on Hadoop platform . Finally, the future research directions are discussed.