东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于参数探索的期望最大化策略搜索

ISSN号：ISSN- 02544156 CN 112109/TP
期刊名称：自动化学报
时间：0
页码：38-45
语言：中文
分类：TP391[自动化与计算机技术—计算机应用技术;自动化与计算机技术—计算机科学与技术]
作者机构：[1]中国矿业大学信息与电气工程学院,徐州221116
相关基金：国家自然科学基金（60804022,60974050,61072094）,教育部新世纪优秀人才支持计划（NCET-08-0836,NCET-10-0765）,霍英东教育基金会青年教师基金（121066）资助
相关项目：基于支持向量机的复杂连续系统强化学习控制研究

关键词：策略搜索, 强化学习, 参数空间, 探索, 期望最大化, 重要采样, Policy search, reinforcement learning, parameter space, exploration, expectation-maximization （EM）, importance sampling

中文摘要：

针对随机探索易于导致梯度估计方差过大的问题，提出一种基于参数探索的期望最大化（Expectationmaximization，EM）策略搜索方法．首先，将策略定义为控制器参数的一个概率分布．然后，根据定义的概率分布直接布控制器参数空间进行多次采样以收集样本．在每一幕样本的收集过程中，由于选择的动作均是确定的，凶此可以减小采样带米的方差，从向减小梯度估计方差．最后，基于收集到的样本，通过最大化期望回报函数的下界米迭代地更新策略参数．为减少采样耗时和降低采样成奉，此处利用重要采样技术以重复使用策略更新过程中收集的样奉．两个连续空间控制问题的仿真结果表明，与基于动作随机探索的策略搜索强化学习方法相比，本文所提方法不仅学到的策略最优，向且加快了算法收敛速度，具有较好的学习性能．

英文摘要：

In order to reduce large variance of gradient estimation resulted from stochastic exploration strategy, a kind of expectation-maximization policy search reinforcement learning with parameter-based exploration is proposed. At first, a probability distribution over the parameters of a controller is used to define a policy. Secondly, samples are collected by directly sampling in the controller parameter space according to the probability distribution for several times. During the sample-collection procedure of each episode, because the selected actions are deterministic, sampling from the defined policy leads to a small variance in the samples, which can reduce the variance of gradient estimation. At last, based on the collected samples, policy parameters are iteratively updated by maximizing the lower bound of the expected return function. In order to reduce the time-consumption and to lower the cost of sampling, an importance sampling technique is used to repeatedly use samples collected from policy update process. Simulation results on two continuous-space control problems illustrate that the proposed policy search method can not only obtain the most optimal policy but also improve the convergence speed as compared with several policy search reinforcement learning methods with action-based stochastic exploration, thus has a better learning performance.

同期刊论文项目