Value Iteration

Next: Technical and Notational Considerations Up: POMDPs and Value Iteration Previous: Policies and Value Functions

Value Iteration

To explain value iteration, we need to consider how belief state evolves over time. Let b be the current belief state. The belief state at the next point in time is determined by the current belief state, the current action a, the next observation z. We denote it by . For any state s', is given by

where and is the renormalization constant. As the notation suggests, the constant can also be interpreted as the probability of observing z after taking action a in belief state b.

Define an operator T that takes a value function V and returns another value function TV as follows:

where is the expected immediate reward for taking action a in belief state b. For a given value function V, a policy is said to be V-improving if

Value iteration is an algorithm for finding -optimal policies. It starts with an initial value function and iterates using the following formula:

It is known (e.g., Puterman 1990, Theorem 6.9) that converges to as n goes to infinity. Value iteration terminates when the Bellman residual falls below . When it does, a -improving policy is -optimal (e.g., Puterman 1990).

Since there are infinitely many belief states, value functions cannot be explicitly represented. Fortunately, the value functions that one encounters in the process of value iteration admit implicit finite representations. Before explaining why, we first introduce several technical concepts and notations.

Dr. Lian Wen Zhang
Thu Feb 15 14:47:09 HKT 2001