A New Approach

The first contribution of this paper is to provide a language and a translation adapted to another class of solution methods which have proven quite effective in dealing with large MDPs, namely anytime state-based heuristic search methods such as LAO* [28], LRTDP [9], and ancestors [8,17,45]. These methods typically start with a compact representation of the MDP based on probabilistic planning operators, and search forward from an initial state, constructing new states by expanding the envelope of the policy as time permits. They may produce an approximate and even incomplete policy, but explicitly construct and explore only a fraction of the MDP. Neither of the two previous proposals is well-suited to such solution methods, the first because the cost of the translation (most of which is performed prior to the solution phase) annihilates the benefits of anytime algorithms, and the second because the size of the MDP obtained is an obstacle to the applicability of state-based methods. Since here both the cost of the translation and the size of the MDP it results in will severely impact on the quality of the policy obtainable by the deadline, we need an appropriate resolution of the tradeoff between the two. Our approach has the following main features. The translation is entirely embedded in the anytime solution method, to which full control is given as to which parts of the MDP will be explicitly constructed and explored. While the MDP obtained is not minimal, it is of the minimal size achievable without stepping outside of the anytime framework, i.e., without enumerating parts of the state space that the solution method would not necessarily explore. We formalise this relaxed notion of minimality, which we call blind minimality in reference to the fact that it does not require any lookahead (beyond the fringe). This is appropriate in the context of anytime state-based solution methods, where we want the minimal MDP achievable without expensive pre-processing. When the rewarding behaviours are specified in PLTL, there does not appear to be a way of achieving a relaxed notion of minimality as powerful as blind minimality without a prohibitive translation. Therefore instead of PLTL, we adopt a variant of future linear temporal logic (FLTL) as our specification language, which we extend to handle rewards. While the language has a more complex semantics than PLTL, it enables a natural translation into a blind-minimal MDP by simple progression of the reward formulae. Moreover, search control knowledge expressed in FLTL [5] fits particularly nicely in this framework, and can be used to dramatically reduce the fraction of the search space explored by the solution method.
Sylvie Thiebaux 2006-01-20