Consider the do-while loop of POINT-BASED-VI (Figure 2). Starting from an initial set of vectors, it generates a sequence of sets. If the initial set is uniformly improvable, then the value functions represented by the sets are monotonically increasing and are upper bounded by the optimal value function. As such, they converge to a value function (which is not necessarily the optimal value function). The question is when to stop the do-while loop.
A straightforward method would be to compute the distance
between two consecutive sets
and
and stop when the distance falls below a threshold.
To compute the distance, one needs to solve
linear programs, which is time consuming.
We use a metric that is less expensive to compute.
To be more specific, we stop the do-while loop when
In words, we calculate the maximum difference between
and
at the witness points of vectors in
and
stop the do-while loop when this quantity is no larger than
. Here
is the threshold on the Bellman
residual for terminating value iteration and
is a number
between 0 and 1. In our experiments, we set it at 0.1.