针对传统支持向量机(SVM)算法难以处理大规模训练数据的困境,提出一种基于Hadoop的数据驱动的并行增量Adaboost-SVM算法(PIASVM)。利用集成学习策略,局部分类器处理一个分区的数据,融合其分类结果得到组合分类器;增量学习中用权值刻画样本的空间分布特性,对样本进行迭代加权,利用遗忘因子实现新增样本的选择及历史样本的淘汰;采用基于HBase的控制器组件用以调度迭代过程,持久化中间结果并减小MapReduce原有框架迭代过程中的带宽压力。多组实验结果表明,所提算法具有优良的加速比、扩展率和数据伸缩度,在保证分类精度的基础上提高了SVM算法对大规模数据的处理能力。
Traditional Support Vector Machine (SVM) algorithm is difficuh to deal with the problem of large scale training data, an efficient data driven Parallel Incremental Adaboost-SVM (PIASVM) learning algorithm based on Hadoop was proposed. An ensemble system was used to make each classifier process a partition of the data, and then integrated the classification results to get the combination classifier. Weights were used to depict the spatial distribution prosperities of samples which were to be iteratively reweighted during the incremental training stage, and forgetting factor was applied to select new samples and eliminate historical samples. Also, the controller component based on HBase was used to schedule the iterative procedure, persist the intermediate results and reduce the bandwidth pressure of iterative MapReduce. The experimental results on multiple data sets demonstrate that the proposed algorithm has good performance in speedup, sizeup and scaleup, and high processing capacity of large-scale data while guaranteeing high accuracy.