针对目前空缺值填补方法在非线性噪声数据集上填补效果不理想的问题,分标称属性和非标称属性两种情况提出一种基于随机森林的空缺值填补算法。该算法首先将空缺值看作决策属性,将其他属性值作为特征属性,然后利用随机森林算法进行空缺值的预测。由于随机森林算法具有良好的非线性数据拟合和抗噪声性能,因此该算法可以有效地提高空缺值的填补准确率。在UCI标准数据集和ORL人脸识别数据集上的对比实验充分说明了该算法较以往的填补方法更为有效。
As present missing values filling algorithms are not satisfactory at filling for non-linear noisy datasets, the paper proposes a miss- ing values filling algorithm based on random forest for both nominal attributes and non-nominal attributes. The algorithm firstly regards the missing value as decision attribute and other attribution values as feature attributes; then by random forest algorithm it executes prediction for the missing value. Since the random forest'algorithm is good at non-linear data fitting and anti-noise property, the proposed algorithm can effec- tively improve the filling rate of the missing value. Comparison experiments on UCI standard datasets and ORL face recognition datasets fully explains that the proposed algorithm is more effective than previous filling algorithms.