Varying the Predictor

In the first set of experiments, we attempted to determine how the quality of ATTac-2001's hotel price predictions affects its performance. To this end, we devised seven price prediction schemes, varying considerably in sophistication and inspired by approaches taken by other TAC competitors, and incorporated these schemes into our agent. We then played these seven agents against one another repeatedly, with regular retraining as described below.

Following are the seven hotel prediction schemes that we used, in decreasing order of sophistication:

In our experiments, we added as an eighth agent EarlyBidder, inspired by the livingagents agent. EarlyBidder used $\mbox{\sf {SimpleMean}}_{ev}$ to predict closing prices, determined an optimal set of purchases, and then placed bids for these goods at sufficiently high prices to ensure that they would be purchased ($1001 for all hotel rooms, just as livingagents did in TAC-01) right after the first flight quotes. It then never revised these bids.

Each of these agents require training, i.e., data from previously played games. However, we are faced with a sort of ``chicken and egg'' problem: to run the agents, we need to first train the agents using data from games in which they were involved, but to get this kind of data, we need to first run the agents. To get around this problem, we ran the agents in phases. In Phase I, which consisted of 126 games, we used training data from the seeding, semifinals and finals rounds of TAC-01. In Phase II, lasting 157 games, we retrained the agents once every six hours using all of the data from the seeding, semifinals and finals rounds as well as all of the games played in Phase II. Finally, in Phase III, lasting 622 games, we continued to retrain the agents once every six hours, but now using only data from games played during Phases I and II, and not including data from the seeding, semifinals and finals rounds.

Table 12: The average relative scores ( $\pm$ standard deviation) for eight agents in the three phases of our controlled experiment in which the hotel prediction algorithm was varied. The relative score of an agent is its score minus the average score of all agents in that game. The agent's rank within each phase is shown in parentheses.

Agent	Relative Score
	Phase I	Phase II	Phase III
$\mbox{\sf {ATTac-2001}}_{ev}$	${ 105.2}\pm { 49.5\ \ (2)}$	${ 131.6}\pm { 47.7\ \ (2)}$	${ 166.2}\pm { 20.8\ \ (1)}$
$\mbox{\sf {ATTac-2001}}_{s}$	${ 27.8}\pm { 42.1\ \ (3)}$	${ 86.1}\pm { 44.7\ \ (3)}$	${ 122.3}\pm { 19.4\ \ (2)}$
EarlyBidder	${ 140.3}\pm { 38.6\ \ (1)}$	${ 152.8}\pm { 43.4\ \ (1)}$	${ 117.0}\pm { 18.0\ \ (3)}$
$\mbox{\sf {SimpleMean}}_{ev}$	${ -28.8}\pm { 45.1\ \ (5)}$	${ -53.9}\pm { 40.1\ \ (5)}$	${ -11.5}\pm { 21.7\ \ (4)}$
$\mbox{\sf {SimpleMean}}_{s}$	${ -72.0}\pm { 47.5\ \ (7)}$	${ -71.6}\pm { 42.8\ \ (6)}$	${ -44.1}\pm { 18.2\ \ (5)}$
$\mbox{\sf {Cond'lMean}}_{ev}$	${ 8.6}\pm { 41.2\ \ (4)}$	${ 3.5}\pm { 37.5\ \ (4)}$	${ -60.1}\pm { 19.7\ \ (6)}$
$\mbox{\sf {Cond'lMean}}_{s}$	${ -147.5}\pm { 35.6\ \ (8)}$	${ -91.4}\pm { 41.9\ \ (7)}$	${ -91.1}\pm { 17.6\ \ (7)}$
CurrentBid	${ -33.7}\pm { 52.4\ \ (6)}$	${ -157.1}\pm { 54.8\ \ (8)}$	${ -198.8}\pm { 26.0\ \ (8)}$

Table 12 shows how the agents performed in each of these phases. Much of what we observe in this table is consistent with our expectations. The more sophisticated boosting-based agents ( $\mbox{\sf {ATTac-2001}}_{s}$ and $\mbox{\sf {ATTac-2001}}_{ev}$ ) clearly dominated the agents based on simpler prediction schemes. Moreover, with continued training, these agents improved markedly relative to EarlyBidder. We also see the performance of the simplest agent, CurrentBid, which does not employ any kind of training, significantly decline relative to the other data-driven agents.

On the other hand, there are some phenomena in this table that were very surprising to us. Most surprising was the failure of sampling to help. Our strategy relies heavily not only on estimating hotel prices, but also taking samples from the distribution of hotel prices. Yet these results indicate that using expected hotel price, rather than price samples, consistently performs better. We speculate that this may be because an insufficient number of samples are being used (due to computational limitations) so that the numbers derived from these samples have too high a variance. Another possibility is that the method of using samples to estimate scores consistently overestimates the expected score because it assumes the agent can behave with perfect knowledge for each individual sample--a property of our approximation scheme. Finally, as our algorithm uses sampling at several different points (computing hotel expected values, deciding when to buy flights, pricing entertainment tickets, etc.), it is quite possible that sampling is beneficial for some decisions while detrimental for others. For example, when directly comparing versions of the algorithm with sampling used at only subsets of the decision points, the data suggests that sampling for the hotel decisions is most beneficial, while sampling for the flights and entertainment tickets is neutral at best, and possibly detrimental. This result is not surprising given that the sampling approach is motivated primarily by the task of bidding for hotels.

We were also surprised that $\mbox{\sf {Cond'lMean}}_{s}$ and $\mbox{\sf {Cond'lMean}}_{ev}$ eventually performed worse than the less sophisticated $\mbox{\sf {SimpleMean}}_{s}$ and $\mbox{\sf {SimpleMean}}_{ev}$ . One possible explanation is that the simpler model happens to give predictions that are just as good as the more complicated model, perhaps because closing time is not terribly informative, or perhaps because the adjustment to price based on current price is more significant. Other things being equal, the simpler model has the advantage that its statistics are based on all of the price data, regardless of closing time, whereas the conditional model makes each prediction based on only an eighth of the data (since there are eight possible closing times, each equally likely).

In addition to agent performance, it is possible to measure the inaccuracy of the eventual predictions, at least for the non-sampling agents. For these agents, we measured the root mean squared error of the predictions made in Phase III. These were: 56.0 for $\mbox{\sf {ATTac-2001}}_{ev}$ , 66.6 for $\mbox{\sf {SimpleMean}}_{ev}$ , 69.8 for CurrentBid and 71.3 for $\mbox{\sf {Cond'lMean}}_{ev}$ . Thus, we see that the lower the error of the predictions (according to this measure), the higher the score (correlation