基于频繁模式的关联分类是近年来出现的一种分类方法,该方法利用各类别频繁出现的模式构造分类规则,并对新文本进行分类.但现有关联分类方法应用于文本分类时存在两方面不足:一方面,用以构造分类规则的频繁模式仅考虑特征词在文本中出现与否,从而忽视了出现频度;另一方面,当产生的规则数量较多时,为提高分类效率需要进行规则修剪,修剪后的分类准确性明显降低.为此,提出了基于分类规则树的带词频的频繁模式文本分类方法.研究结果表明,词频的引入可以提高关联分类的准确率;而采用分类规则树可使分类时间明显加快又确保不降低分类质量.这两方面的措施弥补了现有关联分类应用于文本分类的不足.与3种典型文本分类方法比较后发现,在低维特征空间中,关联分类的性能优于Bayes,kNN(k nearest neighbor)和SVM(support vectormachines),因此是一种很有应用前景的文本分类方法.
Association categorization approach based on frequent patterns has been recently presented, which builds the classification rules according to frequent patterns in various categories and classifies the new text employing these rules. But there are two shortages when the method is applied to classify text data: one is that the method ignores the information about word's frequency in a text; another is that the rule pruning to improve the classification efficiency will lead to obvious descending of accuracy when mass rules are generated. Therefore, a text categorization algorithm based on frequent patterns with term frequency is presented. This study illuminates that the word frequency is helpful for improving the accuracy of the association categorization and the classification rule tree can improve the efficiency of the association classification. The result of experiments shows the performance of association classification is better than three typical text classification methods Bayes, kNN (k nearest neighbor) and SVM (support vector machines), so it is a promising text classification method.