在数据挖掘和机器学习研究中,许多算法以离散值为处理对象,常常需要对连续属性进行离散化。由于正态分布的广泛性,本文提出一种基于正态分布的近似等频离散化方法。该方法实现简单,关于数据集大小具有线性时间复杂度,适用于大规模数据集。在许多数据集上与文献中多个离散化方法进行了对比测试,实验结果表明,提出的无指导的离散化方法是有效、可行的。
Many algorithms for data mining and machine learning require that training examples contain only discrete attributes. In order to use these algorithms when some attributes have numeric attributes, the numeric attributes must be converted into discrete attributes. Because of the extensiveness of normal distribution, an approximate equal frequency discretization method which based on normal distribution is presented. The method is simple to implementation. Time complexity of the presented discretization method is nearly linear with the size of dataset and can be used to large dataset. The experimental results on real datasets show that the discretization method is effective and practicable.