Datasink is a customizable pipeline for generating diverse ensembles of heterogeneous classifiers, as well as the accompanying metadata needed for ensemble learning approaches utilizing ensemble diversity for improved performance. It also fairly evaluates the performance of several ensemble learning methods including greedy selection, enhanced selection [Caruana2004], and stacked generalization (stacking) [Wolpert1992]. Though other tools exist, we are unaware of a similarly modular, scalable pipeline designed for large-scale ensemble learning. Datasink was developed to support research by Sean Whalen and Gaurav Pandey (see [Whalen2013]) with the support of the Icahn Institute for Genomics and Multiscale Biology at Mount Sinai.
Datasink is designed for generating extremely large ensembles (taking days or weeks to generate) and thus consists of an initial data generation phase tuned for multicore and distributed computing environments. The output is a set of compressed CSV files containing the class distribution produced by each classifier that serves as input to a later ensemble learning phase.
Data is generated by a customized pipeline built around the Java-based Weka machine learning package [Hall2009]. For simplicity and extensibility, the pipeline uses an interpreted variant of Java called Groovy that calls compiled Weka code without performance penalty. Thus the data generation prerequisites are:
Ensemble learning is implemented in Python using the popular pandas/scikit-learn analytics stack [McKinney2012,Pedregosa2011]:
Older versions may work for some packages if current versions are not available.
There is no installer for datasink. However, the installation of the prerequisites and their dependencies can usually be handled by the package manager for your operating system. We assume comfort with command line execution and provide setup instructions for Ubuntu Linux and OS X below.
This README details the setup and use of datasink via several examples but is not intended as a general tutorial on ensemble learning, version control, or particular libraries.
Ubuntu and other Debian-based Linux distributions use the apt-get
command for installing packages and their dependencies. See the howto or run man apt-get
for more details.
To install the prerequisites for datasink, run:
sudo apt-get -y install groovy cython python-numpy python-scipy python-pip
sudo pip install -U pandas scikit-learn
A suitable version of Weka is unfortunately not bundled with Ubuntu, so run the following:
sudo apt-get -y install curl unzip
curl -O -L http://prdownloads.sourceforge.net/weka/weka-3-7-10.zip
unzip weka-3-7-10.zip
sudo cp weka-3-7-10/weka.jar /usr/share/java
This option downloads and runs Ubuntu 13.04 64-bit under the VirtualBox virtual machine, incurring some performance penalty but allowing you to evaluate datasink in a completely self-contained, pre-configured environment. Skip this section if you aren't familiar with virtual machines.
First install the following:
then run:
mkdir dvm; cd dvm
vagrant init
vagrant box add base http://cloud-images.ubuntu.com/vagrant/raring/current/raring-server-cloudimg-amd64-vagrant-disk1.box
vagrant up
vagrant ssh
This will download a fresh Ubuntu disk image and start up the virtual machine, taking several minutes to complete and leaving you with a login prompt inside the virtual machine. Proceed with the instructions from Option 1 to install datasink inside this virtual machine, and type exit
to return to your host OS when desired. The virtual machine can be brought down using vagrant halt
from the host command line.
Due to the performance penalty of VMs, extended use of this option is not recommended; it is provided primarily for self-contained evaluation purposes. Performance can be improved substantially by increasing the number of CPU cores and RAM granted to the VM. See the Vagrant documentation for details.
Thanks to Olivier Grisel for the original document these instructions are based on.
There are several options for installing the prerequisites under OS X. Pre-built Python distributions such as Enthought contain the necessary Python components and OS X comes bundled with a suitable version of Java. Advanced users can simply install a binary version of Groovy and Weka from their respective websites, place the Weka JAR file in their CLASSPATH
, and begin generating ensembles.
Other users may wish to use the MacPorts project to install the prerequisites and their dependencies in a self-contained directory that can easily be upgraded or removed later if desired. This option requires Apple's free Xcode developer tools, the optional Xcode command line tools installable from the developer tools GUI, and the MacPorts software for your version of OS X:
MacPorts downloads the required packages and their dependencies, but must compile from source if binaries are not available for your system; this can take hours for a fresh MacPorts installation as there are several dozen large packages to compile. Run the following to update MacPorts and install the prerequisites:
sudo port selfupdate
sudo port install groovy py27-cython py27-pandas py27-scikit-learn
sudo port select --set python python27
A suitable version of Weka is unfortunately not bundled with MacPorts, so run the following:
curl -O -L http://prdownloads.sourceforge.net/weka/weka-3-7-10.zip
unzip weka-3-7-10.zip
sudo cp weka-3-7-10/weka.jar /opt/local/share/java
The latest source code can be obtained by cloning the public git repository using the following from the command line:
git clone https://github.com/shwhalen/datasink.git
This will create a datasink
subdirectory in your working directory containing the source code. The git
program comes bundled with recent versions of OS X; it can be installed under Ubuntu using sudo apt-get -y install git
. Updates can be obtained by running git pull
from the datasink
subdirectory.
Several functions are accelerated by Cython and must first be compiled by running make
from the git repository directory.
Java must be told where Weka is located and how much RAM to use by modifying the CLASSPATH
and JAVA_OPTS
environment variables. A simple way to set these variables is to add the following to your shell's login script for Ubuntu:
export CLASSPATH=$CLASSPATH:/usr/share/java/weka.jar
export JAVA_OPTS="-Xmx4g"
or for OS X:
export CLASSPATH=$CLASSPATH:/opt/local/share/java/weka.jar
export JAVA_OPTS="-Xmx4g"
The above is Bash syntax and allows Weka to use up to 4 gigs of RAM; adjust accordingly for your setup.
The groovy
executable must also be somewhere in your search path (and already is if using the Ubuntu or MacPorts instructions above). If groovy was manually installed, for example under $HOME/groovy
on a cluster, add the following to your login script:
export PATH=$PATH:$HOME/groovy/bin
You're finally ready to setup and construct an ensemble!
Ensemble generation requires 3 files, ideally inside a self-contained project directory:
- Training data in ARFF format
- A file listing the classifiers to train
- A
weka.properties
file pointing to the above files and configuring other pipeline settings
To begin, we create a project directory in our home directory and download an example dataset from the command line:
mkdir ~/diabetes; cd ~/diabetes
curl -O http://repository.seasr.org/Datasets/UCI/arff/diabetes.arff
Next we create a weka.properties
file to configure our pipeline:
cat > weka.properties << EOF
classifiersFilename = classifiers.txt
inputFilename = diabetes.arff
classAttribute = class
predictClassValue = tested_positive
balanceTraining = false
foldCount = 10
nestedFoldCount = 10
bagCount = 10
EOF
Finally, we create classifiers.txt
containing the Weka classifiers and associated parameters we want included in the ensemble:
cat > classifiers.txt << EOF
weka.classifiers.bayes.NaiveBayes -D
weka.classifiers.functions.SGD -F 1
#weka.classifiers.meta.LogitBoost
EOF
Note that weka.classifiers.meta.LogitBoost
is preceded by a comment marker (#); such lines are skipped and thus excluded from ensemble generation. We'll leave LogitBoost commented out for now, and later see how its inclusion changes ensemble performance.
As specified in the weka.properties
file, the data is first divided into 10 folds of independent training and test splits for cross validation. Each training split is resampled with replacement 10 times (a process called bagging [Breiman1996]), and nested cross validation is performed on each of these resampled training splits to produce the data necessary for ensemble techniques.
Before generating the ensemble, we first examine the non-ensemble performance of each base classifier using 10-fold cross validation in Weka. Several performance metrics are produced by Weka, but datasink focuses on the area under the receiver operating characteristic (ROC) curve (AUC) since it is well-suited to imbalanced class distributions that often occur with real data. However, any metric can be computed using the CSV files generated by the analysis scripts.
weka.classifiers.bayes.NaiveBayes -D
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.816 0.392 0.795 0.816 0.806 0.429 0.806 0.882 tested_negative
0.608 0.184 0.639 0.608 0.623 0.429 0.806 0.676 tested_positive
weka.classifiers.functions.SGD -F 1
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.888 0.444 0.789 0.888 0.835 0.478 0.832 0.892 tested_negative
0.556 0.112 0.727 0.556 0.630 0.478 0.832 0.713 tested_positive
Next we construct an ensemble of 20 Naive Bayes (NB) and Logistic Regression (LR, trained using Stochastic Gradient Descent) base classifiers; recall each classifier type is bagged 10 times. This takes ~3-4 minutes on a modern 4 core system with 8 gigs of RAM and should decrease linearly with the number of cores:
cd ~/datasink
python generate.py ~/diabetes
Because the code is architected for multicore and distributed environments, many processes are spawned and each writes its output to a unique file. These files must first be merged:
python combine.py ~/diabetes
Ensemble methods are then applied:
python mean.py ~/diabetes
0.836
python stacking.py ~/diabetes standard
0.837 20
python selection.py ~/diabetes greedy
0.841 15
python selection.py ~/diabetes enhanced
0.839 16
The output after each script gives the AUC calculated over all cross validation folds as well as the average size of the ensemble when applicable. The performance of these methods can vary greatly depending on the dataset and in particular the number of training examples, with simpler methods typically performing better for smaller datasets.
Let's add LogitBoost to the ensemble, looking first at its base performance in Weka:
weka.classifiers.meta.LogitBoost
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.846 0.455 0.776 0.846 0.810 0.410 0.810 0.891 tested_negative
0.545 0.154 0.655 0.545 0.595 0.410 0.810 0.668 tested_positive
We add this classifier to the ensemble by editing classifiers.txt
and using comment markers (#) as discussed above to exclude the previous classifiers and include LogitBoost, or execute the following command as a shortcut:
cat > ~/diabetes/classifiers.txt << EOF
#weka.classifiers.bayes.NaiveBayes -D
#weka.classifiers.functions.SGD -F 1
weka.classifiers.meta.LogitBoost
EOF
Alternately, leave all lines uncommented to see how the ensemble generation script only produces output for LogitBoost as it recognizes NB and LR are already generated. Now create the LogitBoost classifiers (~2 mins), combine with the previous NB and LR output, and run the ensemble methods:
python generate.py ~/diabetes
python combine.py ~/diabetes
python mean.py ~/diabetes
0.836
python stacking.py ~/diabetes standard
0.841 30
python selection.py ~/diabetes greedy
0.843 17
python selection.py ~/diabetes enhanced
0.840 24
Note that the performance of mean-aggregated predictions remains unchanged, while stacking and selection methods get a small boost.
Compare the performance of the Naive Bayes base classifier (0.806) to its bagged performance (0.8254) using mean aggregation: Bagging provides a non-trivial boost. Run cd ~/diabetes
and try the following under ipython
to see how this simple aggregation method works. We first create a pandas DataFrame object indexed by a unique ID and class label for each example:
from glob import glob
from pandas import concat
from sklearn.metrics import roc_auc_score
df = concat([read_csv(_, compression = 'gzip', index_col = [0, 1]) \
for _ in glob('predictions-*.csv.gz')])
The probability assigned to the positive class by each resampled classifier is stored across columns, shown here with a number appended to the classifier name for each bagged version:
print df.columns
Index([NaiveBayes.0, NaiveBayes.1, NaiveBayes.2, NaiveBayes.3, NaiveBayes.4, NaiveBayes.5, NaiveBayes.6, NaiveBayes.7, NaiveBayes.8, NaiveBayes.9, SGD.0, SGD.1, SGD.2, SGD.3, SGD.4, SGD.5, SGD.6, SGD.7, SGD.8, SGD.9, LogitBoost.0, LogitBoost.1, LogitBoost.2, LogitBoost.3, LogitBoost.4, LogitBoost.5, LogitBoost.6, LogitBoost.7, LogitBoost.8, LogitBoost.9], dtype=object)
Here we grab the class labels from the index, take the row mean of the first 10 columns corresponding to Naive Bayes, and calculate the AUC:
labels = df.index.get_level_values(1).values
roc_auc_score(labels, df.iloc[:, :10].mean(axis = 1))
0.8254
A similar increase compared to the base classifier is observed for LogitBoost:
roc_auc_score(labels, df.iloc[:, 20:30].mean(axis = 1))
0.8293
but a small dip occurs for Logistic Regression:
roc_auc_score(labels, df.iloc[:, 10:20].mean(axis = 1))
0.8283
Recall the base LR performance is 0.832 which is already quite close to the ensemble's performance. Resampling the training data with replacement to create diversity excludes approximately one third of the training instances due to chance. Given its decreased performance, the loss of these training examples is more detrimental to LR than what it gains from ensembling. However, the dip is relatively small and resampling creates the diversity necessary to increase the overall performance of the ensemble when other classifiers are included.
The advantage of using heterogeneous ensembles is clear when we compare their performance to state-of-the-art homogeneous ensemble techniques such as Random Forests. To maximize performance of the forest, we increase the number of trees to 500 (a parameter that doesn't overfit [Breiman2001]) and reduce the maximum tree depth to prevent overfitting on this small dataset.
weka.classifiers.trees.RandomForest -I 500 -depth 5
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.864 0.429 0.790 0.864 0.825 0.458 0.833 0.904 tested_negative
0.571 0.136 0.692 0.571 0.626 0.458 0.833 0.700 tested_positive
Compare the 0.833 AUC of this homogeneous ensemble to the 0.843 achieved above using greedy selection with only 3 classifier types.
Let's see if these trends hold for another dataset:
mkdir ~/liver; cd ~/liver
curl -O http://repository.seasr.org/Datasets/UCI/arff/liver-disorders.arff
cat > weka.properties << EOF
classifiersFilename = classifiers.txt
inputFilename = liver-disorders.arff
classAttribute = selector
predictClassValue = 2
balanceTraining = false
foldCount = 10
nestedFoldCount = 10
bagCount = 10
EOF
cat > classifiers.txt << EOF
weka.classifiers.functions.MultilayerPerceptron
weka.classifiers.lazy.IBk
weka.classifiers.meta.AdaBoostM1
weka.classifiers.rules.JRip
EOF
cd ~/datasink
python generate.py ~/liver
python combine.py ~/liver
python mean.py ~/liver
python selection.py ~/liver greedy
python selection.py ~/liver enhanced
python stacking.py ~/liver standard
Method | AUC | Notes |
---|---|---|
AdaBoostM1 | 0.684 | DecisionStump base learner |
IBk | 0.637 | |
JRip | 0.653 | |
MultilayerPerceptron | 0.742 | |
RandomForest | 0.768 | 200 trees, excluded from ensemble |
mean | 0.772 | |
greedy | 0.772 | |
enhanced | 0.764 | |
stacking | 0.775 | RandomForest stacker, max_depth = 5 |
This time we use a different set of classifiers and give the performance of a random forest for reference. Note that 40 bagged heterogeneous classifiers outperform a random forest of 200 trees for three out of four aggregation methods. Though enhanced ensemble selection has not performed as well as simpler methods in these examples, it tends to perform similar to stacking for larger datasets where greedy selection begins to fall behind. It is important to emphasize that differences in performance should be evaluated for statistical significance; see [Demšar2006] for a review of non-parametric comparison methods. In a paper currently under review, we find statistically significant differences between heterogeneous ensembles and the best base classifier (homogeneous ensembles) for several complex, real-world datasets.
The UCI machine learning repository provides datasets for benchmarking machine learning algorithms. Below is the AUC for several UCI binary classification datasets using 3 classifier types discussed above: NB, LR, and LogitBoost. These numbers are provided only for verification purposes: This small ensemble will likely be out-performed by a single well-tuned classifier (often a Random Forest or gradient boosted trees), and for many datasets the best classifier will be out-performed by a larger heterogeneous ensemble. To get more experience with datasink, try adding new classifiers until the ensemble beats the best classifier for some of these datasets.
The highest AUC is bolded for each dataset, and ties are broken by preferring the simplest method. Again, one must perform tests for statistical significance such as those presented in [Demšar2006] to draw sound conclusions about performance differences, and more complex methods often require similarly complex, large, real-world datasets to demonstrate their utility.
Dataset | Instances | Mean | Stacking | Greedy | Enhanced |
---|---|---|---|---|---|
breast-cancer | 286 | 0.683 | 0.67 | 0.704 (3) | 0.691 (19) |
breast-w | 699 | 0.993 | 0.992 | 0.993 (5) | 0.993 (17) |
colic | 368 | 0.872 | 0.883 | 0.874 (9) | 0.875 (25) |
credit-a | 690 | 0.934 | 0.933 | 0.933 (9) | 0.935 (25) |
credit-g | 1000 | 0.785 | 0.793 | 0.795 (7) | 0.794 (25) |
diabetes | 768 | 0.836 | 0.84 | 0.842 (18) | 0.841 (25) |
haberman | 306 | 0.662 | 0.662 | 0.672 (5) | 0.676 (24) |
heart-statlog | 270 | 0.905 | 0.908 | 0.906 (6) | 0.908 (24) |
ionosphere | 351 | 0.967 | 0.96 | 0.960 (9) | 0.971 (21) |
kr-vs-kp | 3196 | 0.993 | 0.996 | 0.996 (7) | 0.996 (26) |
labor | 57 | 0.979 | 1.000 | 0.980 (4) | 0.980 (6) |
liver-disorders | 345 | 0.742 | 0.768 | 0.780 (7) | 0.758 (21) |
molecular-biology_promoters | 106 | 0.977 | 0.97 | 0.970 (11) | 0.956 (18) |
mushroom | 8124 | 1.000 | 1.000 | 1.000 (2) | 1.000 (2) |
sick | 3772 | 0.973 | 0.963 | 0.977 (11) | 0.978 (27) |
sonar | 208 | 0.873 | 0.895 | 0.876 (5) | 0.905 (23) |
spambase | 4601 | 0.976 | 0.972 | 0.978 (15) | 0.978 (27) |
tic-tac-toe | 958 | 0.982 | 0.996 | 0.997 (5) | 0.997 (24) |
vote | 435 | 0.991 | 0.992 | 0.991 (9) | 0.992 (23) |
If a particular class is extremely uncommon, the bagging process may (by chance) produce training splits that do not contain that class due to sampling with replacement. Bagging may not be appropriate in these scenarios and can be disabled in the properties file.
Property | Type | Default | Description |
---|---|---|---|
classifiersFilename | String | Required | File containing a list of full Java classnames and parameters, one per line, of classifiers to include in the ensemble. Lines beginning with a hash (#) are skipped. |
inputFilename | String | Required | A Weka-formatted ARFF file containing features and class labels. |
workingDir | String | Current directory | Location to store classifier outputs. |
classAttribute | String | Required | Name of the ARFF attribute containing class labels. Often the last attribute. |
predictClassValue | String | Required | Value of the positive class for classAttribute. For example, this could be 1 for instances with 0/1 class labels, or tested_positive for the walkthrough dataset. |
balanceTraining | Boolean | true | Balance the class distribution of the training set inside each cross validation fold after any resampling, using Weka's SpreadSubsample filter with -M 1 . |
balanceTest | Boolean | false | Identical to balanceTraining for the test set. Note that best practice for non-uniform class distributions is to balance the training set only, then evaluate against the natural class distribution of the test set [Weiss2003,Tan2013]. |
foldCount | Integer | Required* | Number of cross validation folds to use. This or foldAttribute must be specified. |
foldAttribute | String | Required* | Name of the ARFF attribute containing values for leave-one-value-out cross validation. This or foldCount must be specified. |
nestedFoldCount | Integer | Required | Number of nested cross validation folds to use for each cross validated training set. Greatly increases execution time. |
bagCount | Integer | Required | Number of resampled versions of each base classifier to generate. Greatly increases execution time. A value of 0 disables resampling. |
useCluster | Boolean | false | Submit jobs to a distributed computing cluster (using qsub , for example) instead of spawning processes on the local machine. |
writeModels | Boolean | false | Save compressed, serialized models for each classifier/fold/bag combination to disk. Substantially increases disk usage. |
- Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140. doi:10.1023/A:1018054314350
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
- Caruana, R., Niculescu-Mizil, A., Crew, G., & Ksikes, A. (2004). Ensemble Selection from Libraries of Models. In Proceedings of the 21st International Conference on Machine Learning (pp. 18–26). doi:10.1145/1015330.1015432
- Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7(Jan), 1–30.
- Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1), 10–18. doi:10.1145/1656274.1656278
- McKinney, W. (2012). Python for Data Analysis. O’Reilly.
- Pedregosa, F., Varoquaux, G., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.
- Tan, P.-N., Steinbach, M., & Kumar, V. (2013). Introduction to Data Mining (2nd ed.). Addison-Wesley.
- Weiss, G. M., & Provost, F. J. (2003). Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19(1), 315–354.
- Whalen, S., & Pandey, G. (2013). A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics. In Proceedings of the 13th International Conference on Data Mining (pp. 807–816). doi:10.1109/ICDM.2013.21
- Wolpert, D. H. (1992). Stacked Generalization. Neural Networks, 5(2), 241–259. doi:10.1016/S0893-6080(05)80023-1