To explain value iteration, we need to consider
how belief state evolves over time.
Let b be the current belief state.
The belief state at the next point in time
is determined by the current belief state,
the current action a, the next
observation z.
We denote it by .
For any state s',
is given by
where
and
is the renormalization constant.
As the notation suggests, the constant
can also be interpreted as the probability of
observing z after taking action a in belief state b.
Define an operator T that takes a value function V and returns another value function TV as follows:
where is the
expected immediate reward for taking action a in
belief state b.
For a given value function V,
a policy
is said to be V-improving
if
Value iteration is an algorithm for finding
-optimal policies. It starts with an initial value
function
and iterates using
the following formula:
It is known (e.g., Puterman 1990, Theorem 6.9) that
converges to
as n goes to infinity.
Value iteration terminates when the Bellman residual
falls below
.
When it does,
a
-improving policy is
-optimal
(e.g., Puterman 1990).
Since there are infinitely many belief states, value functions cannot be explicitly represented. Fortunately, the value functions that one encounters in the process of value iteration admit implicit finite representations. Before explaining why, we first introduce several technical concepts and notations.