如何在线找到正确的子目标是基于option的分层强化学习的关键问题.通过分析学习主体在子目标处的动作,发现了子目标的有效动作受限的特性,进而将寻找子目标的问题转化为寻找路径中最匹配的动作受限状态.针对网格学习环境,提出了单向值方法表示子目标的有效动作受限特性和基于此方法的option自动发现算法.实验表明,基于单向值方法产生的option能够显著加快Q学习算法,也进一步分析了option产生的时机和大小对Q学习算法性能的影响.
Although reinforcement learning (RL) is an effective approach for building autonomous agents that improve their performance with experiences, a fundamental problem of the standard RL algorithm is that in practice they are not solvable in reasonable time. The hierarchical reinforcement learning (HRL) is a successful solution which decomposes the learning task into simpler subtasks and learns each of them independently. As a promising HRL, option is introduced as closed-loop policies for sequences of actions to enable HRL. A key problem for HRL based on options is to discover the correct subgoals online. Through analyzing the actions of agents in subgoals, two useful properties are found: (1) the subgoals have more possibility to be passed through and (2) the effective actions in subgoals are restricted. As a consequence, subgoals can be regarded as the most matching actionrestricted states in the paths. Considering the grid environment, the concept of unique-direction value is proposed to denote the action-restricted property, and the option discovering algorithm based on unique-direction value is introduced. The experiments show that the options discovered by the uniquedirection value method can speed up the primitive Q learning significantly. Moreover, the experiments further analyze how the size and generating time of options affects the performance of Q learning.