极速学习机(Extreme learning machine,ELM)虽然已在理论和应用中证实有很好的泛化性能和极快的训练速度,但是在处理非均衡数据时,它更偏向多数类且极容易忽略少数类,基于数据重采样的集成学习可以帮助ELM解决少数类分类精度低的问题.提出一种按类别重采样技术并据此发展了一种ELM集成学习方法.该方法可充分利用少数类样本的信息,实验结果显示该方法性能明显优于单一的ELM学习模型.由于重采样是大数据处理的最核心的技术之一,该方法对非均衡大数据的学习模型建立有着一般性的指导意义.
ELM(Extreme learning machine)has been confirmed that has good generalization performance and fast training speed in theory and application.Because of favoring the majority class,ignoring the minority class and leading to a low classification accuracy of the minority class,ELM can not effectively handle the imbalanced data.Imbalanced data is common in life,such as identifying fraudulent credit card transactions,predicting preterm births,learning word pronunciations,and so on.The main strategies to handle imbalanced data classification include resampling technology,integrated learning,and cost sensitive learning.The basic sampling methods are under-sampling and oversampling.The main principle of new methods is to combine multiple random under-samplings,and further,to develop an ELM-based ensemble learning algorithm.It effectively relieves the problem of low classification accuracy of minority class.In order to evaluate the classification performance on imbalanced data more reasonably,we use F-measureand G-mean values as the evaluation criteria in our experiment.The value is higher,the classification performance of the minority class is better.In this paper,experimental results demonstrate that this method has higher F-measure and G-mean values comparing with the single ELM learning model.It implies that the ELM ensemble learning based on multiple under-samplings can improve the classification performance of the minority class.In addition,every classifier is independent of each other before voting.So the resampling method can be parallel implemented.First,large data set is decomposed into many small data sets,and then each small data set is learned by ELM.So it improves the computing speed.Because the resampling technique is one of the core technologies about processing large data,the method has general guidance significance for establishing the learning model to handle large imbalanced data.