Optimizing in our Setting

Next: Explicit Solution in the Up: Calculating Rule Vectors using Previous: Optimizing in the Setting

Optimizing in our Setting

As previously argued in Theorem 4, minimizing in the setting of Schapire $\&$ Singer (1998) can be done optimally, but at the expense of complex optimization procedures, with large complexities. One can wonder whether such procedures, to optimize only the computation of $\vec{v}$ (a small part of WIDC), are really well worth the adaptation to our setting, in which more values are authorized. We are now going to show that a much simpler combinatorial procedure, with comparatively very low complexity, can bring optimal results in fairly general situations. The most simple way to describe most of these situations is to make the following assumption on the examples:

( ${\cal A}$ ): Each example used to compute $\vec{v}$ has only one ``1'' in its class vector.

A careful reading of assumption ( ${\cal A}$ ) reveals that it implies that each example belongs to exactly one class, but it does not prevent an observation to be element of more than one class, as long as different examples sharing the same observation have different classes (the ``1'' of the class vectors is in different positions among these examples). Therefore, even if it does not integrate the most general features of the ranking loss setting, our assumption still authorizes to consider problems with non zero Bayes optimum. This is really interesting, as many commonly used datasets fall into the category of our assumption, as for example many datasets of the UCI repository of Machine Learning database [Blake et al.1998]. Finally, even if the assumption does not hold, we show that in many of the remaining (interesting) cases, our approximation algorithm is asymptotically optimal, that is, finds solutions closer to the minimal value of

increases.
Suppose for now that ( ${\cal A}$ ) holds. Our objective is to calculate the vector $\vec{v}$ of some monomial

. We use the shorthands $W_0^+, W_1^+, ..., W_{c-1}^+$ to denote the sum of weights of the examples satisfying

and belonging respectively to classes

. We want to minimize

as proposed in eq. (3). Suppose without loss of generality that

$\begin{eqnarray*}W_0^+\leq W_1^+ & \leq & ...\leq W_{c-1}^+ \:\:, \end{eqnarray*}$

otherwise, reorder the classes so that they verify this assertion. Given only three possible values for each component of $\vec{v}$ , the testing of all

possibilities for $\vec{v}$ is exponential and time-consuming. But we can propose a very fast approach. We have indeed

Lemma 1 $\forall 1\leq j<k\leq c, \vec{v}[j]\leq \vec{v}[k]$ .

Proof: See the Appendix. $\hbox{\vrule width 0.8pt \vbox to6pt{\hrule depth 0.8pt width 5.2pt \vfill\hrule depth 0.8pt}\vrule width 0.8pt}$
Thus, the optimal $\vec{v}$ does not belong to a set of cardinality

, but to a set of cardinality ${\cal O}(c^2)$ . Our algorithm is then straightforward: simply explore this set of ${\cal O}(c^2)$ elements, and keep the vector having the lowest value of

. Note that this combinatorial algorithm has the advantage to be adaptable to more general settings in which

particular values are authorized for the components of $\vec{v}$ , for any fixed

not necessarily equal to 3. In that case, the complexity is larger, but limited to ${\cal O}(c^{l-1})$ .
There are slightly more general settings in which our algorithm remains optimal, in particular when we can certify $\forall j, k$ , $( (W^+_{j} > W^+_{k}) \Leftrightarrow (\forall i \neq j, k: W_{j,i} > W_{k,i}))... ... (W^+_{j} < W^+_{k}) \Leftrightarrow (\forall i \neq j, k: W_{j,i} > W_{k,i}) )$ . Here, $W^+_{x}$ denotes the sum of weights of the examples belonging at least to class

, and $W_{x,y}$ denotes the sum of weights of the examples belonging at least to class

, and not belonging at least to class

. This shows that even for some particular multilabel cases, our approximation algorithm can remain optimal. One can wonder if the optimality is preserved in the unrestricted multilabel framework. We now show that, if optimality is not preserved, we can still prove the quality of our algorithm for general multilabel cases, showing asymptotic optimality as

increases.
Our approximation algorithm is run in the multilabel case by transforming the examples as follows: each example $(o,\vec{c}_o)$ for which $1_{\vec{c}_o}>1$ is transformed into $1_{\vec{c}_o}$ examples, having the same description

, and only one ``1'' in their vector, in such a way that we span the $1_{\vec{c}_o}>1$ ``1'' of the original example. Their weight is the one of the original example, divided by $1_{\vec{c}_o}$ . We then run our algorithm on this new set of examples satisfying assumption ( ${\cal A}$ ).
Now, suppose that for any example $(o,\vec{c}_o)$ , we have $1_{\vec{c}_o}\leq k$ for some

. There are two interesting vectors we use. The first one is $\vec{v}^*$ , the optimal vector (or an optimal vector) minimizing

over the original set of examples, the second one is $\vec{v}$ , the vector we find minimizing

over the transformed set of examples. What we want is to estimate the quality of $\vec{v}$ with respect to the optimal value of

over the original set of examples, $Z(\vec{v}^*)$ using our notation. The following theorem gives an answer to this problem, by quantifying its convergence towards $Z(\vec{v}^*)$ .

Theorem 6 $Z(\vec{v}) < Z(\vec{v}^*) \left(1+\left(\frac{e}{c-k}\right)\right)$ .

Proof: See the Appendix. $\hbox{\vrule width 0.8pt \vbox to6pt{\hrule depth 0.8pt width 5.2pt \vfill\hrule depth 0.8pt}\vrule width 0.8pt}$
Therefore, in the set of all problems for which for some $\beta<1$ , $k\leq \beta c$ , we obtain $Z(\vec{v})=(1+o(1)) Z(\vec{v}^*)$ , and our bound converges to the optimum as

increases in this class of problems. By means of words, our simple approximation algorithm is quite efficient for problems with large number of classes. Note that using a slightly more involved proof, we could have reduced the constant ``

'' factor in Theorem 6 to the slightly smaller ``

''. Now, to fix the ideas, the following subsection displays the explicit (and simple) solution when there are only two classes.

Next: Explicit Solution in the Up: Calculating Rule Vectors using Previous: Optimizing in the Setting