东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

属性序下的粗糙集与KNN相结合的英文文本分类研究

ISSN号：1001-7011
期刊名称：《黑龙江大学自然科学学报》
时间：0
分类：TN181[电子电信—物理电子学]
作者机构：[1]北京信息科技大学计算机学院,北京100101
相关基金：国家自然科学基金资助项目（61272375）;北京市教育委员会科技计划科研项目（KM201411232015）;北京市属高等学校创新团队建设与教师职业发展计划项目（IDHT20130519）;北京信息科技大学开放实验室课题（ICDD2015）

作者：朱敏玲[1]

关键词：粗糙集, KNN, 属性序, 文本分类, 属性约简, rough set , KNN , attribute sequence , text classification , attribute reduction

中文摘要：

由于KNN（K Nearest Neighbor）文本分类器的待分类文本数据维数和计算次数较高,其耗费的时间和空间成本也很高,故引入粗糙集的属性约简算法对待分类的数据进行预处理。提出了基于属性序的处理方法和算法,解决粗糙集属性约简中的NP-hard问题。降低算法计算量从算法本身和运算技巧两个层面出发：在粗糙集区分矩阵的关键环节正区域计算上提出递减式计算方法,减少等价类的计算工作量;运用去停止词的查表法、位置信息在属性序中的引入及倒排索引的检索方法等来进一步降低系统的运行时间和空间成本。通过实验验证,经过粗糙集约简处理的KNN分类系统在分类的准确度、召回率与F值上与没有约简的KNN分类器效果相当,但是系统的时间和空间成本大幅降低。

英文摘要：

Due to the high data dimensionality and high computing frequency of the KNN （K Nearest Neighbor） text classifier, which leads to a very high cost of time and space, a novel method is proposed to reduce the cost of the KNN, in which the data is pretreated by the attribute reduction algorithm of rough set, and then an attribute sequence based treatment method and algorithm is employed to solve the NP- hard problem of the rough set attribute reduction. The efforts on the optimized algorithm and operation techniques are made to reduce the amount of calculation. On one hand, the algorithm is optimized by decreasing the workload of the equivalence class, which is realized by employing a novel decreasing type calculation method in the calculation of positive area. It is the key of the discernibility matrix of rough set. On the other hand, the operating cost of whole system is further reduced by using calculation methods including the look-up table in cutting stop word, the location information in attribute sequence, and inverted index retrieval. The method is tested, and the results show no great difference with the KNN classifiers without rough set reduction in classification accuracy, recall rate, and F values, however time and space cost of the system is reduced significantly.

同期刊论文项目