东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

平均奖赏强化学习算法研究

期刊名称：计算机学报. 30(8).1372-1378, 2007. (EI)
时间：0
分类：TP181[自动化与计算机技术—控制科学与工程;自动化与计算机技术—控制理论与控制工程]
作者机构：[1]南京大学软件新技术国家重点实验室,南京210093, [2]江苏省智能卡工程技术研究中心,江苏镇江212300
相关基金：本课题得到国家自然科学基金（60475026）和国家杰出青年科学基金（60325207）资助
相关项目：非马尔可夫决策过程中强化学习技术研究与应用

关键词：平均奖赏强化学习, 性能势, G-学习, 马尔可夫决策过程, 半马尔可夫决策过程, average reward reinforcement learning, performance potential, G-learning, Markovdecision process, semi-Markov decision process

中文摘要：

顺序决策问题常用马尔可夫决策过程（MDP）建模．当决策行为执行从时刻点扩展到连续时间上时，经典的马尔可夫决策过程模型也扩展到半马尔可夫决策过程模型（SMDP）．当系统参数未知时，强化学习技术被用来学习最优策略．文中基于性能势理论，证明了平均奖赏强化学习的逼近定理．通过逼近相对参考状态的性能势值函数，研究一个新的平均奖赏强化学习算法——G-学习算法．G-学习算法既可以用于MDP，也可以用于SMDP．不同于经典的R-学习算法，G-学习算法采用相对参考状态的性能势值函数替代相对平均奖赏和的相对值函数．在顾客访问控制和生产库存仿真实验中，G-学习算法表现出优于R-学习算法和SMART算法的性能．

英文摘要：

A large class of problems of sequence decision making is often modeled as Markov de cision process （MDP）. The problems whose systems with sojourn times can often be modeled as semi-Markov decision process （SMDP）. When the system＇s parameters are unknown in advance, reinforcement learning is used to obtain the optimal policies. In this paper, the approximate theorem of average reward reinforcement learning is proven by means of the theory of performance potentials. A novel average reward reinforcement learning algorithm, G-learning, is designed by approximating the value function of performance potentials. G-learning is applied not only in MDP, but also in SMDP. Different from the classical R-learning algorithm, the G-learning algorithm chooses the potential value of a reference state instead of the average performance of a system. In this paper, the G-learning algorithm is tested in an access-control queuing task and a production inventory task, and the experimental results show that G-learning has better learning performance than R-learning and SMART.

同期刊论文项目