在大数据分析处理中,存在诸多问题,如数据类型多,处理效率低,从中获得有用的信息和知识以便指导后续的决策,这是机器学习的最终目标。有效学习样本逐渐增加,据此如何高效渐进地学习分类器是一个非常有价值的问题。大数据分析要求大量数据流的分布式挖掘要实时执行,设计这样独特的分布式挖掘系统:在线适应传入数据的特征;在线处理大量的异构数据;在分布式学习者之间的有限数据访问和通信能力。提出了一个基本的数据挖掘框架,并基于此研究了一种高效的在线学习算法。框架包括一个整体学习者和只能访问不同输入数据部分的多个局部学习者。通过利用在局部学习者学习的相关性模型,提出的学习算法可以优化预测精度而比现有最先进的学习解决方案需要更少的信息交换和计算复杂度。
In big data analysis and processing,there are many problems,such as data types,low processing efficiency.Getting useful information and knowledge to guide the subsequent decisions is the ultimate goal of machine learning.Effective learning samples increase gradually,so how effectively to learn classifier is a very valuable problem.Big data analysis requires a large amount of data flow to perform real-time distributed mining.It designs unique distributed mining system:online adapting to the characteristics of the incoming data;online processing a large amount of heterogeneous data;the limited data ability to access between distributed learners and communication.It proposes a basic framework of data mining,and based on this it researches a kind of efficient online learning algorithm.Framework contains the whole different learners and local learners which can only have access to the input data.By using the local correlation model,the learning algorithm can optimize the prediction precision than the existing advanced learning solutions,which requires less exchange of information and computational complexity.