在性能势理论框架内,研究折扣和平均准则马尔可夫决策过程(MDP)的统一并行Q学习算法。提出了独立并行Q学习算法和状态划分并行Q学习算法,重点讨论了算法中的关键参数的设计,即同步点如何选择的同步策略和如何合成Q因子的Q值构建策略,给出了一种固定步长结合一定偏移量的同步策略,并分析了并行中Q值构建策略的确定原则,给出了几种Q值构建策略的选择方法。仿真实验表明并行Q学习算法的有效性。
Based on performance potential, some unified parallel implementation methods of Q-learning were considered for Markov decision processes (MDPs) with both average- and discounted criteria. An independent parallel Q-learning algorithm and a state-partition parallel Q-learning algorithm were proposed, where the synchronization strategy was mainly discussed, that is, how to choose synchronization point, and the building strategy of Q values, that is, how to construct new Q-factors with some of the derived Q-factors. A synchronization strategy was provided by combining fixed step with offset step. In addition, the principle for establishing building strategy was analyzed, and then some methods were provided for obtaining building strategy. The simulation results illustrate the effectiveness of the proposed parallel algorithms.