东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于状态-动作图测地高斯基的策略迭代强化学习

期刊名称：自动化学报
时间：0
页码：44-51
语言：中文
分类：TP181[自动化与计算机技术—控制科学与工程;自动化与计算机技术—控制理论与控制工程]
作者机构：[1]中国矿业大学信息与电气工程学院,徐州221116
相关基金：国家自然科学基金（60804022 60974050 61072094）; 教育部新世纪优秀人才支持计划（NCET-08-0836）; 霍英东教育基金会青年教师基金（121066）; 江苏省自然科学基金（BK2008126）资助~~
相关项目：基于支持向量机的复杂连续系统强化学习控制研究

关键词：状态-动作图, 测地高斯核, 基函数, 策略迭代, 强化学习, State-action graph, geodesic Gaussian kernel, basis function, policy iteration, reinforcement learning

中文摘要：

在策略迭代强化学习中,基函数构造是影响动作值函数逼近精度的一个重要因素.为了给动作值函数逼近提供合适的基函数,提出一种基于状态-动作图测地高斯基的策略迭代强化学习方法.首先,根据离策略方法建立马尔可夫决策过程的状态-动作图论描述;然后,在状态-动作图上定义测地高斯核函数,利用基于近似线性相关的核稀疏方法自动选择测地高斯核的中心;最后,在策略评估阶段利用基于状态-动作图的测地高斯核逼近动作值函数,并基于估计的值函数进行策略改进.10×10格子世界的仿真结果表明,与基于状态图普通高斯基和测地高斯基的策略迭代强化学习方法相比,本文所提方法能以较少的基函数、高精度地逼近具有光滑且不连续特性的动作值函数,从而有效地获得最优策略.

英文摘要：

For policy iteration reinforcement learning methods,the construction of basis functions is an important factor of influencing the accuracy of action-value function approximation.In order to construct appropriate basis functions for the action-value function approximation,a policy iteration reinforcement learning method based on geodesic Gaussian basis defined on state-action graph is proposed.At first,a state-action graph for a Markov decision process is constructed according to an off-policy method.Secondly,geodesic Gaussian kernel functions are defined on the state-action graph and a kernel sparsification approach based on approximate linear dependency is used to automatically select centers of the geodesic Gaussian kernels.At last,the geodesic Gaussian kernels based on the state-action graph is used to approximate the action-value function during the process of policy evaluation,and then the policy is improved based on the estimated action-value function.Simulation results concerning a 10 × 10 grid-world illustrate that the proposed method can accurately approximate the action-value function having smoothness and discontinuity properties with less basis functions as compared with the policy iteration reinforcement learning methods based on either ordinary Gaussian basis or geodesic Gaussian basis defined on a state graph,which is helpful for obtaining an optimal policy effectively.

同期刊论文项目