Q(λ)学习算法是一种结合值迭代与随机逼近的思想的基于模型无关的多步离策略强化学习算法.针对经典的Q(λ)学习算法执行效率低、收敛速度慢的问题,从TDError的角度出发,给出n阶TDError的概念,并将n阶TDError用于经典的Q(λ)学习算法,提出一种二阶TDError快速Q(λ)学习算法——s0E—FQ(λ)算法.该算法利用二阶TDError修正Q值函数,并通过资格迹将TDError传播至整个状态动作空间,加快算法的收敛速度.在此基础之上,分析算法的收敛性及收敛效率,在仅考虑一步更新的情况下,算法所要执行的迭代次数T主要指数依赖于1-γ^-1、ε-1。将SOE—FQ(λ)算法用于RandomWalk和MountainCar问题,实验结果表明,算法具有较快的收敛速度和较好的收敛精度.
Q(λ) algorithm is a classic model-free-based off policy reinforcement learning with multiple steps which combines the value iteration and stochastic approximation. Aiming at the low efficiency and slow convergence for traditional Q (λ) algorithm, the n-order TD Error is defined from the aspect of the TD Error which is used to the traditional Q (A) algorithm, and a fast Q (A) algorithm based on the second- order TD Error (SOE-FQ(λ) ) is presented. The algorithm adjusts the Q value with the second-order TD Error and broadcasts the TD Error to the whole state-action space, which speeds up the convergence of the algorithm. In addition, the convergence rate is analyzed, and the number of iteration mainly depends on1-γ^-1、ε-1 under the condition of one-step update. Finally, the SOE-FQ (λ) algorithm is used to the random walk and mountain car, and the experimental results show that the algorithm has the faster convergence rate and better convergence performance.