在文本关联分类研究中,训练样本特征词的分布情况对分类结果影响很大.即使是同一种关联分类算法,在不同的样本集上使用,分类效果也可能明显不同.为此,本文利用加权方法改善文本关联分类器的稳定性,设计实现了基于规则加权的关联分类算法(WARC)和基于样本加权的关联分类算法(SWARC).WARC算法通过规则自适应加权调整强弱不均的分类规则;SWARC算法则自适应地调整训练样本的权重,从根本上改善不同类别样本特征词分布不均的情况.实验结果表明,无论是WARC还是SWARC算法,经过权重调整后的文本分类质量明显提高,特别是SWARC算法分类质量的提高极为显著.
In the research on text association classification,the quality of the classification result is influenced evidently by the distribution of feature words of training samples,the accuracy of classification will obviously fall when the distribution of feature words is uneven. In order to solve the problem,the association classification algorithms based on self-adaptive weighting WARC and SWARC are proposed,where WARC is the text association classification algorithm based on rule weighing,and SWARC is the association classification algorithm based on sample weighing. WARC algorithm adjusts the intensity of classification rules by rule weighting,and SWARC improves the distribution of feature words by the weight of training samples. Experiment result shows the accuracy of association classification algorithms can be obviously improved by self-adaptive weighting.