为解决文本分类中存在的维数灾难、数据集噪声等问题,本研究提出一种利用非线性维数约简算法结合k-最邻近结点算法(k-nearest neighbor algorithm,k-NN)的文本分类算法。该算法首先对数据集进行去噪处理,再采用非线性流形学习中的局部线性嵌入算法恢复高维数据中的中低维流形结构,以实现数据约简,利用经过上述处理的文本数据学习k-NN分类器。实验结果表明,该算法能够有效提高文本分类精度。
In order to save the problems of dimensionality curse, noise data in text categorization, the text categorization algorithm was presented based on the non-linear dimensionality reduction algorithm and combined with k-NN (k-nearest neighbor algorithm). The algorithm first removed the noise data, and then used the locally linear embedding algorithm of non-linear manifold learning to recover low-dimensional manifold structure in high-dimensional data to implement di- mensionality reduction. The processed data was used to construct k-NN classifiers. Experimental results showed that this algorithm could effectively improve the accuracy of text classification.