由于KNN(K Nearest Neighbor)文本分类器的待分类文本数据维数和计算次数较高,其耗费的时间和空间成本也很高,故引入粗糙集的属性约简算法对待分类的数据进行预处理。提出了基于属性序的处理方法和算法,解决粗糙集属性约简中的NP-hard问题。降低算法计算量从算法本身和运算技巧两个层面出发:在粗糙集区分矩阵的关键环节正区域计算上提出递减式计算方法,减少等价类的计算工作量;运用去停止词的查表法、位置信息在属性序中的引入及倒排索引的检索方法等来进一步降低系统的运行时间和空间成本。通过实验验证,经过粗糙集约简处理的KNN分类系统在分类的准确度、召回率与F值上与没有约简的KNN分类器效果相当,但是系统的时间和空间成本大幅降低。
Due to the high data dimensionality and high computing frequency of the KNN (K Nearest Neighbor) text classifier, which leads to a very high cost of time and space, a novel method is proposed to reduce the cost of the KNN, in which the data is pretreated by the attribute reduction algorithm of rough set, and then an attribute sequence based treatment method and algorithm is employed to solve the NP- hard problem of the rough set attribute reduction. The efforts on the optimized algorithm and operation techniques are made to reduce the amount of calculation. On one hand, the algorithm is optimized by decreasing the workload of the equivalence class, which is realized by employing a novel decreasing type calculation method in the calculation of positive area. It is the key of the discernibility matrix of rough set. On the other hand, the operating cost of whole system is further reduced by using calculation methods including the look-up table in cutting stop word, the location information in attribute sequence, and inverted index retrieval. The method is tested, and the results show no great difference with the KNN classifiers without rough set reduction in classification accuracy, recall rate, and F values, however time and space cost of the system is reduced significantly.