东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于分类规则树的频繁模式文本分类

ISSN号：1000-9825
期刊名称：《软件学报》
时间：0
分类：TP18[自动化与计算机技术—控制科学与工程;自动化与计算机技术—控制理论与控制工程]
作者机构：[1]复旦大学计算机与信息技术系,上海200433, [2]福州大学数学与计算机科学学院,福建福州350002
相关基金：Supported by the National Natural Science Foundation of China under Grant No.60173027（国家自然科学基金）;the Science and Technology Foundation of Education Office of Fujian Province of China under Grant No.JB02069（福建省教育厅科技基金）

作者：陈晓云[1,2], 陈袆[1], 王雷[1], 李荣陆[1], 胡运发[1]

关键词：频繁模式, 文本分类, 词频, 关联规则, 分类规则, frequent pattern, text categorization, term frequency, association rule, classification rule

中文摘要：

基于频繁模式的关联分类是近年来出现的一种分类方法,该方法利用各类别频繁出现的模式构造分类规则,并对新文本进行分类.但现有关联分类方法应用于文本分类时存在两方面不足：一方面,用以构造分类规则的频繁模式仅考虑特征词在文本中出现与否,从而忽视了出现频度;另一方面,当产生的规则数量较多时,为提高分类效率需要进行规则修剪,修剪后的分类准确性明显降低.为此,提出了基于分类规则树的带词频的频繁模式文本分类方法.研究结果表明,词频的引入可以提高关联分类的准确率;而采用分类规则树可使分类时间明显加快又确保不降低分类质量.这两方面的措施弥补了现有关联分类应用于文本分类的不足.与3种典型文本分类方法比较后发现,在低维特征空间中,关联分类的性能优于Bayes,kNN（k nearest neighbor）和SVM（support vectormachines）,因此是一种很有应用前景的文本分类方法.

英文摘要：

Association categorization approach based on frequent patterns has been recently presented, which builds the classification rules according to frequent patterns in various categories and classifies the new text employing these rules. But there are two shortages when the method is applied to classify text data： one is that the method ignores the information about word＇s frequency in a text; another is that the rule pruning to improve the classification efficiency will lead to obvious descending of accuracy when mass rules are generated. Therefore, a text categorization algorithm based on frequent patterns with term frequency is presented. This study illuminates that the word frequency is helpful for improving the accuracy of the association categorization and the classification rule tree can improve the efficiency of the association classification. The result of experiments shows the performance of association classification is better than three typical text classification methods Bayes, kNN （k nearest neighbor） and SVM （support vector machines）, so it is a promising text classification method.

同期刊论文项目

面向海量信息管理的中文文本数据库关键技术研究

期刊论文 6 著作 17

同项目期刊论文

图象拼接技术,计算机科学,30（6）

基于最小词频阈值的文档特征选择

基于混淆矩阵的层次结构构造方法比较

一种核模糊分类器的规则生成方法

一种对BBS语料进行话题提取的聚类算法

期刊信息

《软件学报》
北大核心期刊（2011版）

主管单位:中国科学院
主办单位:中国科学院软件研究所中国计算机学会
主编：赵琛
地址：北京8718信箱中国科学院软件研究所
邮编：100190
邮箱：jos@iscas.ac.cn
电话：010-62562563

国际标准刊号：ISSN：1000-9825
国内统一刊号：ISSN：11-2560/TP
邮发代号:82-367

获奖情况:
2001年入选中国期刊方阵“双百期刊”,2000年荣获中国科学院优秀科技期刊一等奖

国内外数据库收录:
俄罗斯文摘杂志,美国数学评论（网络版）,波兰哥白尼索引,德国数学文摘,荷兰文摘与引文数据库,美国工程索引,美国剑桥科学文摘,英国科学文摘数据库,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）,中国北大核心期刊（2000版）

被引量:54609