mySVM/db is an implementation of the Support Vector Machine introduced by V. Vapnik (see [Vapnik/98a]). It is a further developement of mySVM and designed to run directly inside a relational database using an internal JAVA engine. The original algorithm is based on the optimization algorithm of SVMlight as described in [Joachims/99a]. mySVM/db provides code for both pattern recognition and regression analysis.
The goal of mySVM/db is to provide users of data-mining applications with an easy to use SVM implementation that makes use of the numerous advantages for handling large amounts of data - like usability, managability, security and standardized interfaces - that modern database systems offer. On the downside, please not that the direct connection of the program to a database server has a serious impact on runtime. So, if performance is your top goal, you should export your data from the database an run some SVM software (e.g. mySVM) on an own machine.
Please note that for now, mySVM/db should be still regarded as beta software. Do not run it on production servers unless you have tested it for your purposes thoroughly. In particular, mySVM/db was designed such that it only needs read access to the input data, so you should run it with minimal access rights. In conclusion, as the license says: The author is not responsible for implications from the use of this software.
This software is free only for non-commercial use. It must not be distributed without prior permission of the author. The author is not responsible for implications from the use of this software.
Download the newest version of mySVM/db: mysvmDB.0.9.1.zip (Version 0.9.1, October 16th, 2002).
mySVM/db was developed and tested on an Oracle 8.1.6 database. It should run on any newer Oracle database without a problem. With small modifications it should also run on any other database offering an JDBC interface. All you need to change is writing a JAVA class for your database system, corresponding to the one mySVMdb/Container/OracleDatabaseContainer.java, that instanciates the correct database driver.
If you are using mySVM/db on any other database than Oracle 8.1.6, please e-mail me your experiences to rueping@ls8.cs.uni-dortmund.de.
When mySVM/db is called, it needs to be given the name of a parameter table. It then reads all necessary parameters, including the names for the input and output table, from the parameter table. For the optimization process, mySVM/db reads the examples from the input table. It also creates a temporary table to store some intermediate results. After the optimization is finished, the SVM model is stored in a new table. Also, a new view on the input table is created that shows the SVM predictions for each example.
WARNING: The spelling of all parameters are case sensitive. All mySVM/db-parameters have lower-case names.
The name of the input table (which actually can be a view as well) is given by the parameter trainset. The input table itself must consist of three parts: a column giving a unique key for each example, a column for the target value and columns for the attribute values. There may be additional columns in the table.
The key column is specified by the parameter key_column. If no such parameter is given, the Oracle pseudo-column ROWID is used, which is the recommended setting. If you use another key column, make sure there you have an index on this column.
The column of target values is given by the parameter y_column with a default name of Y. For regression analysis, this column must consist of numerical values. For pattern recognition, there are two possibilities: internally, the SVM uses values of -1 for negative examples and +1 for positive examples. If your data is already in this format, you don't have to do anything. In all other cases, you need to supply a parameter target_concept. All examples whose target value equals the value of target_concept are regarded as positive examples, all other examples are negative examples. Note that in the final prediction, confidence values for the target concept are given, not the concept itself.
The attributes can be specified by the parameter x_column. Each column in the attribute vector needs an own x_column entry in the parameters table. The columns must be of a type that can be converted into a real number. If no x_column parameter is specified, all columns of the table of matching types (excluding y_column and key_column) are used.
By default, all attribute columns are scaled to expectancy 0 and variance 1 before training. This is advised to avoid numerical problems in the optimization process. If your attribute values are already sensibly scaled, you can avoid the additional scaling by supplying a parameter scale with empty value.
If you don't want to train on all examples, you can provide a list of the keys of the training set by setting up the output table (a table consisting of a VARCHAR row called KEY and a NUMBER row called ALPHA) and inserting the keys into this table. Also, enter a parameter read_keys_from_model with value true. If this parameter is not given or there are no keys in the output table, all examples in the input table will be used.
The output consists of two parts: the SVM model, i.e. the Lagrangian multipliers of the support vectors, and a view of this table and the input table, that gives the SVM prediction for each example. For pattern recognition problems, the prediction is a confidence value for the target concept, i.e. the example is predicted to belong to the concept, if the value is greater than 0.
The name of the model table is given by the parameter model_name, the name of the view is given by view_name. The default model name is SVM_MODEL. If no parameter view_name exists, no view will be created.
If you create the output table before the training, please note that the table may be deleted and re-created during the training process.
if you start mySVM/db from a terminal, there will also be output on your screen. The parameter verbosity controls the amount of output that is generated. Default is 3, the higher, the more output.
Parameter | Possible values | Default value | Description |
---|---|---|---|
svm_type | regression, pattern | regression | The type of the SVM. |
C | real values > 0 | 1 | Capacity constant. |
Cpos | real values > 0 | C | Capacity constant for positive examples. Overrides C. |
Cneg | real values > 0 | C | Capacity constant for negative examples. Overrides C. |
balance_cost | true, false | false | Multiplies Cpos by the fraction of positive examples in the data and Cneg by the fraction of negative examples in the data. |
quadraticLossPos | true, false | false | Uses a quadratic loss function on the positive examples. |
quadraticLossNeg | true, false | false | Uses a quadratic loss function on the negative examples. |
epsilon | real values >= 0 | 0 | Allowed error for regression estimation. |
epsilon_pos | real values >= 0 | 0 | Allowed error for positive examples for regression estimation. |
epsilon_neg | real values >= 0 | 0 | Allowed error for negative examples for regression estimation. |
Parameter | Possible values | Default value | Description |
---|---|---|---|
kernel:type | dot, radial | dot | Kernel type. |
kernel:gamma | real values >= 0 | 1 | Parameter gamma of RBF kernel: K(x,y) = exp(-gamma*||x-y||^2). |
These are parameters that control the optimization process of the SVM. Usually, the default values do not need to be changed. Don't touch this, unless you know what you're doing!
Parameter | Possible values | Default value | Description |
---|---|---|---|
working_set_size | integer > 1 | 10 | Size of the working set. |
is_zero | real value > 0 | 1e-10 | Numerical precision of the optimizer. |
convergence_epsilon | real value > 0 | 1e-3 | Allowed error on the KKT conditions. |
shrink_const | integer > 1 | 50 | Number of iterations before examples at bound is considered for shrinking. |
descend | real value > 0 | 1e-15 | Minimum descend of the target function that the optimizer must make. |
max_iterations | integer >= 0 | 30000 | Maximum number of working set iterations. |
Joachims/99a | Joachims, Thorsten (1999). Making large-Scale SVM Learning Practical. In Advances in Kernel Methods - Support Vector Learning, chapter 11. MIT Press. [.ps.gz] [.pdf] |
Vapnik/98a | V. Vapnik (1998). Statistical Learning Theory. Wiley. |