Having defined a language to represent behaviours to be rewarded, we
now turn to the problem of computing, given a reward formula, a
minimum allocation of rewards to states actually encountered in an
execution sequence, in such a way as to satisfy the formula. Because
we ultimately wish to use anytime solution methods which generate
state sequences incrementally via forward search, this computation is
best done on the fly, while the sequence is being generated. We
therefore devise an incremental algorithm based on a model-checking
technique normally used to check whether a state sequence is a model
of an FLTL formula [4]. This technique is known
as formula progression because it `progresses' or `pushes' the
formula through the sequence.
Our progression technique is shown in Algorithm 1.
In essence, it computes the modelling relation
given in
Section 3.2. However,unlike the definition of
, it is designed to be useful when states in the sequence
become available one at a time, in that it defers the evaluation of
the part of the formula that refers to the future to the point where
the next state becomes available. Let
be a state, say
,
the last state of the sequence prefix
that has been
generated so far, and let
be a boolean true iff
is in
the behaviour
to be rewarded. Let the
formula
describe the allocation of rewards over all possible futures. Then the
progression of
through
given
, written
, is a
new formula which will describe the allocation of rewards over all
possible futures of the next state, given that we have just
passed through
. Crucially, the function
is Markovian,
depending only on the current state and the single boolean value
.
Note that
is computable in linear time in the length of
,
and that for $-free formulae, it collapses to FLTL formula
progression [4], regardless of the value of
.
We assume that Prog incorporates the usual simplification for
sentential constants
and
:
simplifies to
,
simplifies to
, etc.
The fundamental property of
is the following. Where
:
Proof:
See Appendix B.
Like
, the function
seems to require
(or at
least
) as input, but of course when progression is applied in
practice we only have
and one new state at a time of
, and
what we really want to do is compute the appropriate
, namely
that represented by
. So, similarly as in
Section 3.2, we now turn to the second step, which
is to use
to decide on the fly whether a newly generated
sequence prefix
is in
and so should be allocated a
reward. This is the purpose of the functions
and
, also given in Algorithm 1. Given
and
, the function
in
Algorithm 1 defines an infinite sequence of
formulae
in the obvious way:
To decide whether a prefix
of
is to be rewarded,
first tries progressing the formula
through
with the boolean flag set to `false'. If that gives a consistent
result, we need not reward the prefix and we continue without
rewarding
, but if the result is
then we know that
must be rewarded in order for
to satisfy
. In
that case, to obtain
we must progress
through
again, this time with the boolean flag set to the value
`true'. To sum up, the behaviour corresponding to
is
.
To illustrate the behaviour of $FLTL progression, consider the
formula
stating that a reward
will be received the first time
is true. Let
be a state in
which
holds, then
. Therefore,
since the formula has progressed to
,
is true and a reward is received.
, so the reward formula
fades away and will not affect subsequent progression steps. If, on
the other hand,
is false in
, then
. Therefore, since the formula has not
progressed to
,
is false and no reward is received.
, so the
reward formula persists as is for subsequent progression steps.
The following theorem states that under weak assumptions, rewards are
correctly allocated by progression:
Theorem 1
Let
be reward-normal, and let
be the result of progressing it through the successive states of a
sequence
using the function
.
Then, provided no
is
, for all
iff
.
Proof: See Appendix B
The premise of the theorem is that
never progresses to
. Indeed if
for some
, it means that even
rewarding
does not suffice to make
true, so something
must have gone wrong: at some earlier stage, the boolean
was
made false where it should have been made true. The usual explanation
is that the original
was not reward-normal. For instance
, which is reward unstable, progresses to
in
the next state if p is true there: regardless of
,
,
, and
, so if
then
. However, other (admittedly
bizarre) possibilities exist: for example, although
is reward-unstable, its substitution instance
, which also progresses to
in a few steps, is logically equivalent to
and is
reward-normal.
If the progression method were to deliver the correct minimal behaviour
in all cases (even in all reward-normal cases) it would have to
backtrack on the choice of values for the boolean flags. In the
interest of efficiency, we choose not to allow backtracking. Instead,
our algorithm raises an exception whenever a reward formula progresses
to
, and informs the user of the sequence which caused the
problem. The onus is thus placed on the domain modeller to select
sensible reward formulae so as to avoid possible progression to
.
It should be noted that in the worst case, detecting reward-normality
cannot be easier than the decision problem for $FLTL so it is not to
be expected that there will be a simple syntactic criterion for
reward-normality.
In practice, however, commonsense precautions such as avoiding making
rewards depend explicitly on future tense expressions suffice to keep
things normal in all routine cases. For a generous class of
syntactically recognisable reward-normal formulae, see Appendix
A.
Sylvie Thiebaux
2006-01-20