针对随机探索易于导致梯度估计方差过大的问题,提出一种基于参数探索的期望最大化(Expectationmaximization,EM)策略搜索方法.首先,将策略定义为控制器参数的一个概率分布.然后,根据定义的概率分布直接布控制器参数空间进行多次采样以收集样本.在每一幕样本的收集过程中,由于选择的动作均是确定的,凶此可以减小采样带米的方差,从向减小梯度估计方差.最后,基于收集到的样本,通过最大化期望回报函数的下界米迭代地更新策略参数.为减少采样耗时和降低采样成奉,此处利用重要采样技术以重复使用策略更新过程中收集的样奉.两个连续空间控制问题的仿真结果表明,与基于动作随机探索的策略搜索强化学习方法相比,本文所提方法不仅学到的策略最优,向且加快了算法收敛速度,具有较好的学习性能.
In order to reduce large variance of gradient estimation resulted from stochastic exploration strategy, a kind of expectation-maximization policy search reinforcement learning with parameter-based exploration is proposed. At first, a probability distribution over the parameters of a controller is used to define a policy. Secondly, samples are collected by directly sampling in the controller parameter space according to the probability distribution for several times. During the sample-collection procedure of each episode, because the selected actions are deterministic, sampling from the defined policy leads to a small variance in the samples, which can reduce the variance of gradient estimation. At last, based on the collected samples, policy parameters are iteratively updated by maximizing the lower bound of the expected return function. In order to reduce the time-consumption and to lower the cost of sampling, an importance sampling technique is used to repeatedly use samples collected from policy update process. Simulation results on two continuous-space control problems illustrate that the proposed policy search method can not only obtain the most optimal policy but also improve the convergence speed as compared with several policy search reinforcement learning methods with action-based stochastic exploration, thus has a better learning performance.