顺序决策问题常用马尔可夫决策过程(MDP)建模.当决策行为执行从时刻点扩展到连续时间上时,经典的马尔可夫决策过程模型也扩展到半马尔可夫决策过程模型(SMDP).当系统参数未知时,强化学习技术被用来学习最优策略.文中基于性能势理论,证明了平均奖赏强化学习的逼近定理.通过逼近相对参考状态的性能势值函数,研究一个新的平均奖赏强化学习算法——G-学习算法.G-学习算法既可以用于MDP,也可以用于SMDP.不同于经典的R-学习算法,G-学习算法采用相对参考状态的性能势值函数替代相对平均奖赏和的相对值函数.在顾客访问控制和生产库存仿真实验中,G-学习算法表现出优于R-学习算法和SMART算法的性能.
A large class of problems of sequence decision making is often modeled as Markov de cision process (MDP). The problems whose systems with sojourn times can often be modeled as semi-Markov decision process (SMDP). When the system's parameters are unknown in advance, reinforcement learning is used to obtain the optimal policies. In this paper, the approximate theorem of average reward reinforcement learning is proven by means of the theory of performance potentials. A novel average reward reinforcement learning algorithm, G-learning, is designed by approximating the value function of performance potentials. G-learning is applied not only in MDP, but also in SMDP. Different from the classical R-learning algorithm, the G-learning algorithm chooses the potential value of a reference state instead of the average performance of a system. In this paper, the G-learning algorithm is tested in an access-control queuing task and a production inventory task, and the experimental results show that G-learning has better learning performance than R-learning and SMART.