信息系统连续型属性值的离散化对决策规则或决策树的学习具有非常重要的意义。它能够提高系统对样本的聚类能力,增强系统抗数据噪音的能力,减少机器学习算法的时间和空间开销,提高其学习精度。粗集是有效的数据离散化工具。对基于粗集理论的数据离散化方法进行了深入研究,分析其特征,评述其研究进展,并通过仿真实验研究了几种典型的启发式离散化算法的性能。其结果对发展新的离散化技术或为特定应用选择合适算法都有参考价值。
Due to its potentials of cutting down space and time requirements, improving learning accuracies of machine learning algorithms and enhancing the system capabilities of clustering instances and counteracting data noise, the discretization of continuous attribute values of information systems contributes significantly to the induction of decision rules or trees. Rough set theory is a valid tool for discretizing continuous information systems. Herein, data discretization methods based on rough set theory are thoroughly studied. Their characteristics are analyzed from various perspectives; their research developments are briefly introduced and commented; at last, the performances of some typical rough set based heuristic algorithms for data discretization are studied through simulation experiments. The results are helpful for both developing new technologies for data discretization and applying proper algorithms to specific applications.