传统的决策树构建方法,由于其选择扩展属性时的归纳偏置,导致属性值较多的属性总会被优先选择,从而导致树的规模过大,并且泛化能力下降,因此需对其进行简化.剪枝是简化的一种,分为预剪枝和后剪枝.该文主要针对预剪枝中的分支合并进行研究.文中研究了分支合并对决策树归纳学习的影响;具体讨论了在决策树的产生过程中,选择适当的分支合并策略对决策树进行分钟合并处理后,能否增强树的可理解性,减少树的复杂程度以及提高树的泛化精度;基于信息增益,分析了分支合并后决策树的复杂程度,设计实现了一种基于正例比的分支合并算法SSID和一种基于最大增益补偿的分支合并算法MCID.实验结果显示:SSID和MCID所得到的决策树在可理解性和泛化精度方面均明显优于See5.
Since inductive bias exists during the process of selection of expanded attributes, attributes with more values are usually preferred to be selected. It consequently results in a decision tree with large scale and with poor generalization capability. Therefore it is necessary to simplify the decision tree including pre-pruning and post-pruning. This paper focuses on the pre-pruning. A new strategy of pre-pruning is given, that is, at the process of tree growth, two branches (or more) from the same node are merged into one branch and then the tree growth process continues. This paper investigates the impact of merging branches on decision tree induction. The main concerns are whether the comprehensibility, the size and the generalization accuracy of a decision tree can be improved if an appropriate merging strategy is selected and applied. Based on information gain, this paper analyzes the complexity of a decision tree before and after merging branches, and designs two algorithms of merging branches, SSID (based on the proportion of positive samples) and MCID (based on the most gain compensation). Experimental results show that with respect to the comprehensibility and the generalization capability, either SSID or MCID is significantly superior to the frequently used See5 system (the improved version of C4.5).