Next: Related Work
Up: Acquiring Word-Meaning Mappings for
Previous: Artificial Data
Active Learning
As indicated in the previous sections, we have built an integrated
system for language acquisition that is flexible and useful.
However, a major difficulty remains: the construction
of training corpora. Though annotating sentences is still arguably
less work than building an entire system by hand, the annotation task
is also time-consuming and error-prone. Further, the training pairs often
contain redundant information. We would like to minimize the amount
of annotation required while still maintaining good generalization
accuracy.
To do this, we turned to methods in active learning. Active
learning is a research area in machine learning that
features systems that automatically select the most informative
examples for annotation and training
[Angluin1988,Seung et al.1992], rather than relying on a benevolent
teacher or random sampling. The primary
goal of active learning is to reduce the number of examples that the
system is trained on, while maintaining the accuracy of the acquired
information. Active learning systems may construct their own
examples, request certain types of examples, or determine which of a
set of unsupervised examples are most usefully labeled. The last
approach, selective sampling [Cohn et al.1994], is particularly
attractive in natural language learning, since there is an abundance
of text, and we would like to annotate only the most informative
sentences. For many language learning tasks, annotation is
particularly time-consuming since it requires specifying a complex
output rather than just a category label, so reducing the number of
training examples required can greatly increase the utility of
learning.
In this section, we explore the use of active learning, specifically
selective sampling, for lexicon acquisition, and demonstrate that with
active learning, fewer examples are required to achieve the same
accuracy obtained by training on randomly chosen examples.
The basic algorithm for selective sampling
is relatively simple. Learning begins with a small pool of
annotated examples and a large pool of unannotated examples, and the
learner attempts to choose the most informative additional examples
for annotation. Existing work in the area has emphasized two
approaches, certainty-based methods [Lewis Catlett1994], and
committee-based methods
[McCallum Nigam1998,Freund et al.1997,Liere Tadepalli1997,Dagan Engelson1995,Cohn et al.1994];
we focus here on the former.
In the certainty-based paradigm, a system is trained on a small number
of annotated examples to learn an initial classifier. Next, the
system examines unannotated examples, and attaches certainties to the
predicted annotation of those examples. The k examples with the
lowest certainties are then presented to the user for annotation and
retraining. Many methods for attaching certainties have been used,
but they typically attempt to estimate the probability that a
classifier consistent with the prior training data will classify a new
example correctly.
Figure 20:
Selective Sampling Algorithm
 |
Figure 20 presents abstract pseudocode for
certainty-based selective sampling. In an ideal situation, the batch
size, k, would be set to one to make the most intelligent decisions
in future choices, but for efficiency reasons in retraining batch
learning algorithms, it is frequently set higher. Results on a number
of classification tasks have demonstrated that this general approach
is effective in reducing the need for labeled examples (see citations
above).
Applying certainty-based sample selection to WOLFIE requires
determining the certainty of a complete annotation of a potential new
training example, despite the fact that individual learned lexical
entries and parsing operators perform only part of the overall annotation task.
Therefore, our general approach is to compute certainties for pieces
of an example, in our case, phrases, and combine these to obtain an
overall certainty for an example. Since lexicon entries contain no
explicit uncertainty parameters, we used WOLFIE's
heuristic measure to estimate uncertainty.
To choose the sentences to be annotated in each round, we first
bootstrapped an initial lexicon from a small corpus, keeping track of
the heuristic values of the learned items. Then, for each unannotated
sentence, we took an average of the heuristic values of the lexicon
entries learned for phrases in that sentence, giving a value of zero
to unknown words but eliminating from consideration any words that we
assume are known in advance, such as database constants. Thus, longer
sentences with only a few known phrases would have a lower certainty
than shorter sentences with the same number of known phrases; this is
desirable since longer sentences will be more informative from a
lexicon learning point of view.
The
sentences with the lowest values were chosen for annotation, added to
the bootstrap corpus, and a new lexicon learned. Our technique is
summarized in Figure 21.
Figure 21:
Active Learning for WOLFIE
 |
To evaluate our technique, we compared active learning to learning
from randomly selected examples, again measuring the effectiveness of
learned lexicons as background knowledge for CHILL.
We again used the (smaller) U.S. Geography corpus, as in the original
WOLFIE tests, using the lexicons as background knowledge during parser
acquisition (and using the same examples for parser
acquisition).
For each trial
in the following experiments, we first randomly divide the data into a
training and test set. Then, n=25 bootstrap examples are randomly
selected from the training examples and in each step of active
learning, the least certain k=10 examples of the remaining training
examples are selected and added to the training set. The result of
learning on this set is evaluated after each step.
The accuracy of the resulting
learned parsers was compared to the accuracy of those learned using
randomly chosen examples to learn lexicons and parsers, as in
Section 5;
in other words, we can think of the k examples
in each round as being chosen randomly.
Figure 22 shows the accuracy on unseen data of parsers
learned using the lexicons learned by WOLFIE when examples are chosen
randomly and actively.
Figure 22:
Using Lexicon Certainty for Active Learning
 |
There is an annotation savings of around 50 examples by using active
learning: the maximum accuracy is reached after 175 examples, versus
225 with random examples. The advantage of using active
learning is clear from the beginning, though the differences between
the two curves are only statistically significant at 175 training
examples. Since we are learning both lexicons and parsers, but only
choosing examples based on WOLFIE's certainty measures, the boost
could be improved even further if CHILL had a say in the examples chosen.
See Thompson (1999) for a description of active learning for CHILL.
Next: Related Work
Up: Acquiring Word-Meaning Mappings for
Previous: Artificial Data
Cindi Thompson
2003-01-02