策略梯度优化算法是一种很重要的强化学习算法,对实现机器人的自主导航有着重要的应用价值。在部分可观Markov决策过程(POMDP)的基础上,实现了两个有限记忆的策略梯度优化算法:基于模型的GAMP算法和无模型的IState-GPOMDP算法,并利用该算法对机器人走迷宫的问题进行了仿真。通过分析仿真结果,对这两种算法引入了基于观测的优化;并发现在所给报酬函数下,策略梯度算法中的步长参数也在一定程度上影响着优化策略的效率。
Policy-gradient algorithm is a very important way of reinforcement learning algorithm, which is of significant value to a robot's navigation by itself. On the basis of partially observable Markov decision processes, two finite-memory policy-gradient algorithms, that is, model-based GAMP algorithm and model-free IState-GPOMDP algorithm, were implemented, and employed in the simulation of a robot walking in a maze. According to the analysis of experimental results, GAMP algorithm and IState-GPOMDP algorithm were optimized based on observation. And it is found that the step, the parameter in Policy-gradient algorithm, has effect, to some extent, on the efficiency of optimization of the robot's action policy under certain rewarding function circumstance.