机器学习分类领域提出大量的分类算法,如何为数据集找到合适的分类算法成为研究的重要内容之一.文献[8]提出一种新的数据集离散化方法用来刻画数据集的特征,且在推荐方法方面取得较好的结果.本文在此基础上利用交互信息理论刻画数据集的属性与属性及属性与类标签之间协作关系,提出基于二变量和基于三变量的交互信息特征结构.通过12种分类算法在UCI数据库中的98个数据集上的性能实验,结果表明与文献[8]的方法相比,两种方法都能明显提高推荐方法的精度和命中率,且对于适应性较差的数据集,基于三变量的交互信息方法更为有效.
In machine learning area, classification algorithms are widely studied and a large number of different types of algorithms are proposed. How to select appropriate ones from so many classification algorithms for the datasets becomes a crucial problem. Recently, a new method in reference [8 ~ is proposed to characterize datasets and achieve better resuks in algorithm recommendation. In this paper, two methods are presented to characterize datasets under the theory of interaction information. The performance of 12 different types of classification algorithms on the 98 UCI datasets illustrates that both two-variable and three-variable interaction information methods can improve the precision and the hit rate of recommended algorithms. Furthermore, the latter performs even better under datasets with poor adaptability.