东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

用于文本分类的改进KNN算法

ISSN号：1003-0077
期刊名称：《中文信息学报》
时间：0
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]河北大学数学与计算机学院,河北保定071002, [2]天津大学系统工程研究所,天津300072, [3]沧州市城建档案馆,河北沧州061000
相关基金：国家自然科学基金资助项目（60275020）

关键词：计算机应用, 中文信息处理, 文本分类, 神经网络, Chi—square距离, KNN算法, computer application, Chinese information processing, text categorization, neural network, Chi-square distance, KNN algorithm

中文摘要：

最近邻分类器是假定局部的类条件概率不变,而这个假定在高维特征空间中无效。因此在高维特征空间中使用k最近邻分类器,不对特征权重进行修正就会引起严重的偏差。本文采用灵敏度法,利用前馈神经网络获得初始特征权重并进行二次降维。在初始权重下,根据样本间相似度采用SS树方法将训练样本划分成若干小区域,以此寻找待分类样本的近似k0个最近邻,并根据近似k0个最近邻和Chi-square距离原理计算新权重,搜索出新的k个最近邻。此方法在付出较小时间代价的情况下,在文本分离中可获得较好的分类精度的提高。

英文摘要：

Nearest neighbor classification assumes locally constant class conditional probabilities. The assumption becomes invalid in feature space with high dimension. When KNN classier is used in feature space high dimension, severe bias can be introduced if the weights of features are not amended. In this paper, initial weights of text features are acquired based on sensitivity method firstly, and the second dimension reduce is done. Then training samples are divided into many groups based on sample similarity and the initial weights by using SS tree, k0 approximate nearest neighbors of unknown sample are acquired by using SS tree. Weights are computed again based on k0 approximate nearest neighbors and chi-square distance theory. K nearest neighbors are acquired based on new weights. Little time is spent, but the better accuracy of text categorization is acquired.

同期刊论文项目