传统的KNN文本分类算法是一种无监督的、无参数的、简单的、较流行的且容易实现的分类算法。但是KNN算法在处理文本分类的过程中需要不断地计算待测文本与样本的相似度,当文本数量更大时,算法的效率就会更差。为了提高传统KNN算法在文本分类中的效率,提出一种基于聚类的改进KNN算法。算法开始之前采用改进χ~2统计量方法进行文本特征提取,再依据聚类方法将文本集聚类成几个簇,最后利用改进的KNN方法对簇类进行文本分类。实验对比与分析结果表明,该方法可以较好地进行文本分类。
The traditional KNN text classification algorithm is a classification method which is an unsupervised, no parame- ters, simply, more popular and it's easily to achieve. But it need to constantly calculate the similarity between the test and sample text sets, when larger amounts of the text, the efficiency will be much more worse. To improve the classification effi- ciency of the traditional KNN algorithm, this paper proposed an improved KNN algorithm based on the clustering. Before this algorithm, it used an improved X2 statistics way to extract the feature of texts, then making the text sets into several clusters based on clustering method, at last it used the improved KNN way to classify the texts. The experiment and analysis results show that this algorithm can better deal with the text classification.