在Web文本分类中当类别数量庞大或者类别复杂情况下,层次分类是一种有效的分类方法,但其不足之一是在大类正确划分的前提下,由于子类之间存在较多共性,导致分类精度下降.而层次结构本质决定了同一大类下的子类存在特征交叉现象,针对这一局限性,结合KNN的优越性能,提出了一种结合层次结构和KNN的Web文本分类方法.该方法通过建立层次结构模型(树形结构),分类时先从层次结构模型获得相似度最大的k0个类别,然后在kO个类别训练文档中抽取部分代表样本采用KNN算法.最后由一种改进的相似度计算方法决定最终的所属类别.实验表明,结合层次结构和KNN的方法在Web文本分类中能够获得较好的分类效果.
Level-classification is an effect method in Web text classification, especially when the classes are large or complex; but the precision will fall because of the commonness among subclasses, after style class partition correctly; In fact, the hierarchical structure decides some subclass may own many the same features, because of this, we combine the merit of KNN and bring forward one method of Web text classification combined hierarchical structure and KNN. This method builds up hierarchical structure model (tree structure), when classify, firstly get the most similarity k0 real classes from hierarchical structure model, and then use KNN arithmetic in some representative training texts of k0 real classes, at last we make class by one new similar arithmetic in KNN. The result of research indicates that, the impact of the new method is better.