针对大规模或复杂的随机动态规划系统,可利用其分层结构特点或引入分层控制方式,借助分层强化学习(HierarchicalReinforcementLearning,HRL)来解决其“维数灾”和“建模难”问题.HRL归属于样本数据驱动优化方法,通过空间/时间抽象机制,可有效加速策略学习过程.其中,Option方法可将系统目标任务分解成多个子目标任务来学习和执行,层次化结构清晰,是具有代表性的HRL方法之一.传统的Option算法主要是建立在离散时间半马尔可夫决策过程(Semi-MarkovDecisionProcesses,SMDP)和折扣性能准则基础上,无法直接用于解决连续时间无穷任务问题.因此本文在连续时间SMDP框架及其性能势理论下,结合现有的Option算法思想,运用连续时间SMDP的相关学习公式,建立一种适用于平均或折扣性能准则的连续时间统一Option分层强化学习模型,并给出相应的在线学习优化算法.最后通过机器人垃圾收集系统为仿真实例,说明了这种HRL算法在解决连续时间无穷任务优化控制问题方面的有效性,同时也说明其与连续时间模拟退火Q学习相比,具有节约存储空间、优化精度高和优化速度快的优势.
For large-scale or complex systems with stochastic dynamic programming, we can refer to hierarchical reinforcement learning (HRL) to overcome the curse of dimensionality and the curse of modeling according to their hierarchical structures or hierarchical control modes. HRL belongs to the methodology of sample data-driven optimization, and due to the introduction of spatial or temporal abstraction mechanism, it can be used to accelerate the process of policy learning. The Option method is one of the HRL techniques which can decompose the task of the system into multiple subtasks for learning and implementation. The traditional Option methods are based on discrete-time semi-Markov decision process (SMDP) with discounted criteria, which cannot apply to continuous-time infinite tasks. Therefore, in this paper, we extend the existing Option algorithms to continuous-time case by utilizing relative learning formula of continuous- time SMDPs, and propose a unified online Option algorithm that applies to either average or discounted criteria. The algorithm is under the framework of performance potential theory and continuous-time SMDP model. Finally, we illustrate the effectiveness of the proposed HRLalgorithm in solving the optimization problem of continuous-time infinite tasks by a robotic garbage collection system. The simulation results show that it needs less memory, and has better optimization performance and faster learning speed than a continuous-time flat Q-learning algorithm based on simulated annealing technique.