针对K最近邻算法测试复杂度至少为线性,导致其在大数据样本情况下的效率很低的问题,提出了一种应用于大数据下的快速KNN分类算法。该算法创新性地在K最近邻算法中引入训练过程,即通过线性复杂度聚类方法对大数据样本进行分块,然后在测试过程中找出与待测样本距离最近的块,并将其作为新的训练样本进行K最近邻分类。这样的过程大幅度地减少了K最近邻算法的测试开销,使其能在大数据集中得以应用。实验表明,该算法在与经典KNN分类准确率保持近似的情况下,分类的速度明显快于经典KNN算法。
Aiming at the problems of the K-nearest neighbor algorithm,testing complex is linear at least,and lead to the accuracy is low when the samples are large. This paper proposed a fast KNN classification algorithm faster than the traditional KNN did. The proposed algorithm innovatively introduced the training process during the KNN method,i. e.,the algorithm blocked the big data by linear complexity clustering. Then,the algorithm selected the nearest cluster as new training samples and established a classification model. This process reduced the KNN algorithm testing overhead,which made the proposed algorithm could be applied to big data. Experiments result shows that the accuracy of the proposed KNN classification is similarity than the traditional KNN,but the classification speed has been significantly improved.