Since the current observation does not fully reveal the identity
of the current state, the agent
needs to consider all previous observations
and actions when choosing an action.
Information about the current state contained in the current
observation, previous observations, and previous actions
can be summarized by a probability distribution over the
state space (Aström 1965). The probability distribution is sometimes
called a belief state and denoted by b.
For any possible state s, b(s) is the probability that
the current state is s.
The set of all possible belief states is called
the belief space. We denote it by .
A policy prescribes an action for each
possible belief state.
In other words, it is a mapping from to
. Associated
with a policy
is its value function
.
For each belief state b,
is the expected total discounted
reward that the agent receives by following the policy starting
from b, that is
where is the reward received at
time t and
(
) is the discount factor.
It is known that there exists a policy
such
that
for any other policy
and
any belief state b (Puterman 1990).
Such a policy is called an optimal policy.
The value function of an optimal policy is called the optimal
value function. We denote it by
.
For any positive number
, a policy
is
-optimal
if