东篱科研大数据发现系统（DRDS）

位置：成果数据库 > 期刊 > 期刊详情页

基于路径匹配的在线分层强化学习方法

ISSN号：1000-1239
期刊名称：《计算机研究与发展》
时间：0
分类：TP18[自动化与计算机技术—控制科学与工程;自动化与计算机技术—控制理论与控制工程]
作者机构：[1]北京邮电大学北京市智能软件与多媒体重点实验室,北京100876, [2]中国科学院计算技术研究所智能信息处理重点实验室,北京100190
相关基金：国家自然科学基金项目（60402011,90604017,60675010）;国家“十一五”科技支撑计划基金项目（2006BAH03805） A fundamental problem of the standard reinforcement learning algorithm is that in practice they are not solvable in reasonable time due to the size of the state space and the lack of immediate reinforcement signal. The hierarchical reinforcement learning （HRL） is an effective solution which decomposes the learning task into simpler subtasks and learns each of them independently. As a promising approach automatically defining the required decomposition, option is introduced as closed-loop policies for sequences of actions to enable HRL. The key problem that develops appropriate options automatically is to identify subgoals and learn options for these subgoals. Through analyzing the actions of agents in subgoals, this paper discovers that the effective actions in subgoals are restricted. As a consequence, subgoals can be regarded as the most matching action- restricted states in the paths. Considering the grid environment, this paper proposes the concept of unique-direction value to denote the action-restricted property, and introduce the options discovering algorithm based on unique-direction value further. The experiments show that the options discovered by the unique-direction value method can speed up the primitive Q learning significantly. This work is supported by the National Natural Science Foundation of China （No. 60402011, 90604017, and 60675010）.

关键词：强化学习, 分层强化学习, OPTION, 子目标, 路径匹配, matching reinforcement learning, hierarchical reinforcement learning, option, subgoal, path

中文摘要：

如何在线找到正确的子目标是基于option的分层强化学习的关键问题．通过分析学习主体在子目标处的动作，发现了子目标的有效动作受限的特性，进而将寻找子目标的问题转化为寻找路径中最匹配的动作受限状态．针对网格学习环境，提出了单向值方法表示子目标的有效动作受限特性和基于此方法的option自动发现算法．实验表明，基于单向值方法产生的option能够显著加快Q学习算法，也进一步分析了option产生的时机和大小对Q学习算法性能的影响．

英文摘要：

Although reinforcement learning （RL） is an effective approach for building autonomous agents that improve their performance with experiences, a fundamental problem of the standard RL algorithm is that in practice they are not solvable in reasonable time. The hierarchical reinforcement learning （HRL） is a successful solution which decomposes the learning task into simpler subtasks and learns each of them independently. As a promising HRL, option is introduced as closed-loop policies for sequences of actions to enable HRL. A key problem for HRL based on options is to discover the correct subgoals online. Through analyzing the actions of agents in subgoals, two useful properties are found：（1） the subgoals have more possibility to be passed through and （2） the effective actions in subgoals are restricted. As a consequence, subgoals can be regarded as the most matching actionrestricted states in the paths. Considering the grid environment, the concept of unique-direction value is proposed to denote the action-restricted property, and the option discovering algorithm based on unique-direction value is introduced. The experiments show that the options discovered by the uniquedirection value method can speed up the primitive Q learning significantly. Moreover, the experiments further analyze how the size and generating time of options affects the performance of Q learning.

同期刊论文项目