研究了几类典型增强学习算法的性能评估问题,包括Q-学习算法、最小二乘策略迭代(LSPI)和基于核的最小二乘策略迭代(KLSPI)算法等,重点针对Markov决策问题(MDP)的值函数平滑特性对算法性能的影响进行了研究。分别利用值函数非平滑的组合优化问题——旅行商问题(TSP)和值函数平滑的Mountain-Car运动控制问题,对不同增强学习算法的性能进行了测试和比较分析。分析了三种算法针对不同类型问题的各自特点,通过实验对比,验证了近似策略迭代算法,特别是KLSPI算法在解决值函数平滑的序贯决策问题时性能更优。通过分析实验结果表明,MDP值函数的平滑程度是影响近似策略迭代算法性能表现的重要因素。
This paper studied the performance evaluation problem for reinforcement learning ( RL) algorithms,including Q-learning,least-squares policy iteration( LSPI) and kernel based least-squares policy iteration( KLSPI) . Investigated the performance influence of the smoothness of value functions in Markov decision processes in detail. Tested the RL algorithms on a combinatorial optimization problem—the traveling salesman problem ( TSP) ,which had non-smooth value functions and the Mountain-Car motion control problem with smooth value functions. Analyzed the characteristics of different RL algorithms and demonstrated that approximate policy iteration algorithms,especial KLSPI,had better performance when solving sequential decision-making problems with smooth value functions. Furthermore,it verifies that whether is the sequential decision-making problems with smooth value functions or not will play an important role in the performance of approximate policy iteration.