本文提出测度值马尔可夫决策过程新模型.在此模型下,agent对环境的把握用测度概念来表示,于是agent则根据测度来决定自己的最优行动以得到最优策略,因此本文也提供了测度值马尔可夫决策过程的最优策略算法.该模型是部分可观察马尔可夫决策过程的推广,它反映人类思维的一个重要特征,人们在把握全部状态可能性(即对状态空间进行权衡度量)的态势下,思考问题并选择自己的最优行动.部分可观察马尔可夫决策过程只是它的一种特例.
This paper presents a model called measure-valued Markov decision processes (MVMDPs) and within this model the understanding of the agent to the environment is denoted by the mathematical notion of measure. The agent decides his own optimal action according to this measure and then acquires his optimal policy. So we present an algorithm of finding optimal policy under MVMDP, which can also be considered as the approximate optimal policy algorithm of partially observed Markov decision processes (POMDPs). This model is a generalization of a partially observed Markov decision process, that is, partially observed Markov decision process is a particular case of the measure-valued Markov decision process. Be that as it may, it is essentially different from all other papers about POMDPs. Firstly, the main spirit of general POMDPs is to transform partially observable Markov decision problems off a physical state space into a regular Markov decision problem (MDP) on the corresponding belief state space, and such researches all identify the belief state as a probability distribution over the state space. So most of the POMDP models based on this spirit pay more attention to algorithm Of various kinds for finding the optimal policy and to novel refinements of existing techniques. However, our work is not based on the transformation between the POMDP on a physical state space and the MDP on a belief state space. On the contrary we take the measure, a more general notion than belief state, on the state space as a new studying object. Then the Markov decision problem we will discuss is taking place on the space composed of these measures. In this way, we have a measure-valued Markov decision process. Secondly, MVMDP, based on the latest theory of measur-valued branching processes in modern probability, reflects an important characteristic of human mind: that people think about problems and choose their own optimal actions in contexts where all the possible states are caught (i. e. , they are able to appropria