As indicated in the introduction,
we propose to perform point-based DP update a number of times
in between two standard DP updates.
To be more specific, we propose to modify value iteration
in the way as shown in Figure 3.
Note that the only change is at line 5.
Instead of assigning directly to
,
we pass it to a subroutine POINT-BASED-VI and
assign the output of the subroutine to
.
The subroutine
functions in the same way as value iteration,
except that it performs
point-based DP updates rather than standard DP updates.
Hence we call it
point-based value iteration.
Figure 4 illustrates the basic idea behind modified
value iteration in contrast to value iteration.
When the initial value function is properly
selected, the sequence of
value functions produced by value iteration converges
monotonically to the optimal value function.
Convergence usually takes a long time partially because
standard DP updates, indicated by fat upward arrows,
are computationally expensive. Modified value iteration
interleaves standard DP updates with point-based DP updates, which are
indicated by the thin upward arrows.
Point-based DP update does not improve a value function as
much as standard DP update. However, its complexity is
much lower. As a consequence, modified value iteration can
hopefully converge in less time.
The idea of interleaving standard DP updates with approximate updates that back up at a finite number of belief points is due to Cheng (1988). Our work differs from Cheng's method mainly in the way we select the belief points. A detailed discussion of the differences will be given in Section 8.
Figure 4: Illustration of the Basic Idea behind Modified Value Iteration.
The modified value iteration algorithm raises three issues. First, what stopping criterion do we use for point-based value iteration? Second, how can we guarantee the stopping criterion can eventually be satisfied? Third, how do we guarantee the convergence of the modified value iteration algorithm itself? To address those issues, we introduce the concept of uniformly improvable value functions.