next up previous
Next: Value Iteration Up: POMDPs and Value Iteration Previous: POMDPs

Policies and Value Functions

Since the current observation does not fully reveal the identity of the current state, the agent needs to consider all previous observations and actions when choosing an action. Information about the current state contained in the current observation, previous observations, and previous actions can be summarized by a probability distribution over the state space (Aström 1965). The probability distribution is sometimes called a belief state and denoted by b. For any possible state s, b(s) is the probability that the current state is s. The set of all possible belief states is called the belief space. We denote it by tex2html_wrap_inline1036 .

A policy prescribes an action for each possible belief state. In other words, it is a mapping from tex2html_wrap_inline1036 to tex2html_wrap_inline996 . Associated with a policy tex2html_wrap_inline1042 is its value function tex2html_wrap_inline1044 . For each belief state b, tex2html_wrap_inline1048 is the expected total discounted reward that the agent receives by following the policy starting from b, that is

  eqnarray84

where tex2html_wrap_inline1052 is the reward received at time t and tex2html_wrap_inline1056 ( tex2html_wrap_inline1058 ) is the discount factor. It is known that there exists a policy tex2html_wrap_inline1060 such that tex2html_wrap_inline1062 for any other policy tex2html_wrap_inline1042 and any belief state b (Puterman 1990). Such a policy is called an optimal policy. The value function of an optimal policy is called the optimal value function. We denote it by tex2html_wrap_inline1068 . For any positive number tex2html_wrap_inline1070 , a policy tex2html_wrap_inline1042 is tex2html_wrap_inline1070 -optimal if

displaymath1026



Dr. Lian Wen Zhang
Thu Feb 15 14:47:09 HKT 2001