Next: Discussion
Up: System Evaluation
Previous: Experimental Variables
Figure 2:
Average number of interactions per conversation.
 |
The results of this experiment generally supported our hypothesis
with respect to efficiency. We provide figures that show average
values over all users in a particular group, with error bars showing
the 95% confidence intervals. The x axis always shows the
progression of user's interactions with the system over time: each
point is for the nth conversation completed by either finding
an acceptable restaurant or quitting.
Figure 2 shows that, for the modeling group, the average
number of interactions required to find an acceptable restaurant
decreased from 8.7 to 5.5, whereas for the control group this quantity
actually increased from 7.6 to 10.3. We used linear regression to
characterize the trend for each group and compared the resulting
lines. The slope for the modeling line differed significantly
(p=0.017) from that for the control line, with the former smaller
than the latter, as expected.
The difference in interaction times (Figure 3)
was even more dramatic. For the
modeling group, this quantity started at 181 seconds and ended
at 96 seconds, whereas for the control group, it started at 132 seconds
and ended at 152 seconds. We again used linear regression to
characterize the trends for each group over time and again found
a significant difference (p=0.011) between the two curves, with the
slope for the modeling subjects being smaller than that for the
control subjects.
We should also note that these measures include some time for
system initialization (which could be up to 10% of the total dialogue
time). If we had instead used as the start time the first
system utterance of each dialogue, the
difference between the
two conditions would be even clearer.
Figure 3:
Average time per conversation.
 |
The speech recognizer rejected 28 percent of the interactions in our
study. Rejections slow down the conversation but do not introduce
errors. The misrecognition rate was much lower - it occurred in only
seven percent of the interactions in our experiment. We feel both of
these rates are acceptable, but expanding the number of supported
utterances could reduce the first number further, while
potentially increasing the second. In the most common recognition
error, the ADAPTIVE PLACE ADVISOR inserted extra constraints that the
user did not intend.
The results for effectiveness were more
ambiguous. Figure 4 plots the rejection rate as a
function of the number of sessions. A decrease in rejection rate over
time would mean that, as the system gains experience with the user, it
asks about fewer features irrelevant to that user. However, for this
dependent variable we found no significant difference (p=0.515)
between the regression slopes for the two conditions and, indeed, the
rejection rate for neither group appears to decrease with
experience. These negative results may be due to the rarity of
rejection speech acts in the experiment. Six people never rejected a
constraint and on average each person used only 0.53
REJECT speech acts after an ATTEMPT-CONSTRAIN
per conversation (standard deviation =
0.61).
Figure 5 shows the results for hit rate, which
indicate that suggestion accuracy stayed stable over time for the
modeling group but decreased for the control group. One explanation
for the latter, which we did not expect, is that control users became
less satisfied with the PLACE ADVISOR's suggestions over time
and thus carried out more exploration at item presentation time.
However, we are more concerned here with the difference between the
two groups. Unfortunately, the slopes for the two regression lines
were not significantly different (p=0.1354) in this case.
Figure 4:
Rejection rate for modeling and control groups.
 |
We also analyzed the questionnaire presented to subjects after the
experiment. The first six questions (see Appendix A)
had check boxes to which we assigned
numerical values, none of which revealed a significant difference
between the two groups. The second part of the questionnaire contained
more open-ended questions about the user's experience with the ADAPTIVE
PLACE ADVISOR. In general, most subjects in both groups liked the system
and said they would use it fairly often if given the opportunity.
Next: Discussion
Up: System Evaluation
Previous: Experimental Variables
Cindi Thompson
2004-03-29