东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

以LDA为例的大规模分布式机器学习系统分析

ISSN号：1001-9081
期刊名称：《计算机应用》
时间：0
分类：TP181[自动化与计算机技术—控制科学与工程;自动化与计算机技术—控制理论与控制工程]
作者机构：[1]并行与分布处理国家重点实验室国防科学技术大学,长沙410073, [2]国防科学技术大学计算机学院,长沙410073
相关基金：国家自然科学基金资助项目（61222205）.

作者：唐黎哲, 冯大为[1,2], 李东升[1,2], 李荣春[1,2], 刘锋[1,2]

关键词：隐含狄利克雷分布, 主题模型, 文本聚类, 吉布斯采样, 变分贝叶斯推理, 机器学习, Latent Dirichlet Allocation （LDA）, topic model, text clustering, Gibbs sampling, variational Bayes inference, machine learning

中文摘要：

针对构建大规模机器学习系统在可扩展性、算法收敛性能、运行效率等方面面临的问题,分析了大规模样本、模型和网络通信给机器学习系统带来的挑战和现有系统的应对方案。以隐含狄利克雷分布（LDA）模型为例,通过对比三款开源分布式LDA系统——Spark LDA、PLDA＋和Light LDA,在系统资源消耗、算法收敛性能和可扩展性等方面的表现,分析各系统在设计、实现和性能上的差异。实验结果表明：面对小规模的样本集和模型,Light LDA与PLDA＋的内存使用量约为Spark LDA的一半,系统收敛速度为Spark LDA的4至5倍;面对较大规模的样本集和模型,Light LDA的网络通信总量与系统收敛时间远小于PLDA＋与Spark LDA,展现出良好的可扩展性。“数据并行＋模型并行”的体系结构能有效应对大规模样本和模型的挑战;参数弱同步策略（SSP）、模型本地缓存机制和参数稀疏存储能有效降低网络开销,提升系统运行效率。

英文摘要：

Aiming at the problems of scalability, algorithm convergence performance and operational efficiency in building large-scale machine learning systems, the challenges of the large-scale sample, model and network communication to the machine learning system were analyzed and the solutions of the existing systems were also presented. Taking Latent Diriehlet Allocation （LDA） model as an example, by comparing three open source distributed LDA systems--Spark LDA, PLDA ＋ and LightLDA, the differences in system design, implementation and performance were analyzed in terms of system resource consumption, algorithm convergence performance and scalability. The experimental results show that the memory usage of LightLDA and PLDA ＋ is about half of Spark LDA, and the convergence speed is 4 to 5 times of Spark LDA in the face of small sample sets and models. In the case of large-scale sample sets and models, the network communication volume and system convergence time of LightLDA is much smaller than PLDA ＋ and SparkLDA, showing a good scalability. The model of ＂data parallelism ＋ model parallelism＂ can effectively meet the challenge of large-scale sample and model. The mechanism of Stale Synchronous Parallel （SSP） model for parameters, local caching mechanism of model and sparse storage of parameter can reduce the network cost effectively and improve the system operation efficiency.

同期刊论文项目

分布式计算

期刊论文 10 会议论文 14 著作 1

同项目期刊论文

Large-scaleVirtual Machines Provisioning in Clouds: Challenges and Approaches

Bulk Construction of Geo-Textual Indices

CSR:Classified Source Routing in DHT-Based Networks

EfficientMulti-tenant Virtual machine Allocation in Cloud Data Centers

Efficient Multi-tenant Virtual machine Allocation in Cloud Data Centers

Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit

期刊信息

《计算机应用》
北大核心期刊（2011版）

主管单位:四川省科学技术协会
主办单位:四川省计算机学会中国科学院成都分院
主编：张景中
地址：成都市人民南路四段九号科分院计算所
邮编：610041
邮箱：xzh@joca.cn
电话：028-85224283

国际标准刊号：ISSN：1001-9081
国内统一刊号：ISSN：51-1307/TP
邮发代号:62-110

获奖情况:
全国优秀科技期刊一等奖,国家期刊奖提名奖,中国期刊方阵双奖期刊,中文核心期刊,中国科技核心期刊

国内外数据库收录:
俄罗斯文摘杂志,波兰哥白尼索引,美国剑桥科学文摘,英国科学文摘数据库,日本日本科学技术振兴机构数据库,中国中国科技核心期刊,中国北大核心期刊（2004版）,中国北大核心期刊（2008版）,中国北大核心期刊（2011版）,中国北大核心期刊（2014版）,中国北大核心期刊（2000版）

被引量:53679