KNN是重要数据挖掘算法之一,具有良好的文本分类性能。传统的KNN方法对所有样本权重看作相同,而忽略了不同样本对于分类贡献的不同。为了解决该个问题,提出了一种样本重要性原理,并在此基础上构造KNN分类器。应用随机游走算法识别类边界点,并计算出每个样本点的边界值,生成每个样本点的重要性得分,将样本重要性与KNN方法融合形成一种新的分类模型———SI-KNN。在中英文文本语料上的实验表明:改进的SI-KNN分类模型相比于传统的KNN方法有一定的提高。
As one of the top ten data mining algorithms,KNN has good performance of text classification. All samples are treated as the same as its weight in the traditional KNN method,but the question that the different sample has the different contribution to the classification has been ignored. To solve the problem,a sample importance principals and KNN classifier constructed on the basis of this principle has been presented. Using the random walk algorithm to identify these samples near the class boundary,and calculate the boundary value of each sample. To generate the score of sample importance of each sample from the boundary value,combined sample importance with KNN method to form a new classification model. Experimental results show that the new SI-KNN classifier has some improvement compared to the traditional KNN method on the Chinese and English text corpus.