Skip to content

shwhalen/datasink

Repository files navigation

datasink: A Pipeline for Large-Scale Heterogeneous Ensemble Learning

Datasink is a customizable pipeline for generating diverse ensembles of heterogeneous classifiers, as well as the accompanying metadata needed for ensemble learning approaches utilizing ensemble diversity for improved performance. It also fairly evaluates the performance of several ensemble learning methods including greedy selection, enhanced selection [Caruana2004], and stacked generalization (stacking) [Wolpert1992]. Though other tools exist, we are unaware of a similarly modular, scalable pipeline designed for large-scale ensemble learning. Datasink was developed to support research by Sean Whalen and Gaurav Pandey (see [Whalen2013]) with the support of the Icahn Institute for Genomics and Multiscale Biology at Mount Sinai.

Datasink is designed for generating extremely large ensembles (taking days or weeks to generate) and thus consists of an initial data generation phase tuned for multicore and distributed computing environments. The output is a set of compressed CSV files containing the class distribution produced by each classifier that serves as input to a later ensemble learning phase.

Data is generated by a customized pipeline built around the Java-based Weka machine learning package [Hall2009]. For simplicity and extensibility, the pipeline uses an interpreted variant of Java called Groovy that calls compiled Weka code without performance penalty. Thus the data generation prerequisites are:

Ensemble learning is implemented in Python using the popular pandas/scikit-learn analytics stack [McKinney2012,Pedregosa2011]:

Older versions may work for some packages if current versions are not available.

There is no installer for datasink. However, the installation of the prerequisites and their dependencies can usually be handled by the package manager for your operating system. We assume comfort with command line execution and provide setup instructions for Ubuntu Linux and OS X below.

This README details the setup and use of datasink via several examples but is not intended as a general tutorial on ensemble learning, version control, or particular libraries.

Setup option 1: Ubuntu Linux

Ubuntu and other Debian-based Linux distributions use the apt-get command for installing packages and their dependencies. See the howto or run man apt-get for more details.

To install the prerequisites for datasink, run:

sudo apt-get -y install groovy cython python-numpy python-scipy python-pip
sudo pip install -U pandas scikit-learn

A suitable version of Weka is unfortunately not bundled with Ubuntu, so run the following:

sudo apt-get -y install curl unzip
curl -O -L http://prdownloads.sourceforge.net/weka/weka-3-7-10.zip
unzip weka-3-7-10.zip
sudo cp weka-3-7-10/weka.jar /usr/share/java

Setup option 2: Ubuntu virtual machine

This option downloads and runs Ubuntu 13.04 64-bit under the VirtualBox virtual machine, incurring some performance penalty but allowing you to evaluate datasink in a completely self-contained, pre-configured environment. Skip this section if you aren't familiar with virtual machines.

First install the following:

then run:

mkdir dvm; cd dvm
vagrant init
vagrant box add base http://cloud-images.ubuntu.com/vagrant/raring/current/raring-server-cloudimg-amd64-vagrant-disk1.box
vagrant up
vagrant ssh

This will download a fresh Ubuntu disk image and start up the virtual machine, taking several minutes to complete and leaving you with a login prompt inside the virtual machine. Proceed with the instructions from Option 1 to install datasink inside this virtual machine, and type exit to return to your host OS when desired. The virtual machine can be brought down using vagrant halt from the host command line.

Due to the performance penalty of VMs, extended use of this option is not recommended; it is provided primarily for self-contained evaluation purposes. Performance can be improved substantially by increasing the number of CPU cores and RAM granted to the VM. See the Vagrant documentation for details.

Thanks to Olivier Grisel for the original document these instructions are based on.

Setup option 3: OS X

There are several options for installing the prerequisites under OS X. Pre-built Python distributions such as Enthought contain the necessary Python components and OS X comes bundled with a suitable version of Java. Advanced users can simply install a binary version of Groovy and Weka from their respective websites, place the Weka JAR file in their CLASSPATH, and begin generating ensembles.

Other users may wish to use the MacPorts project to install the prerequisites and their dependencies in a self-contained directory that can easily be upgraded or removed later if desired. This option requires Apple's free Xcode developer tools, the optional Xcode command line tools installable from the developer tools GUI, and the MacPorts software for your version of OS X:

MacPorts downloads the required packages and their dependencies, but must compile from source if binaries are not available for your system; this can take hours for a fresh MacPorts installation as there are several dozen large packages to compile. Run the following to update MacPorts and install the prerequisites:

sudo port selfupdate
sudo port install groovy py27-cython py27-pandas py27-scikit-learn
sudo port select --set python python27

A suitable version of Weka is unfortunately not bundled with MacPorts, so run the following:

curl -O -L http://prdownloads.sourceforge.net/weka/weka-3-7-10.zip
unzip weka-3-7-10.zip
sudo cp weka-3-7-10/weka.jar /opt/local/share/java

Obtaining the source

The latest source code can be obtained by cloning the public git repository using the following from the command line:

git clone https://github.com/shwhalen/datasink.git

This will create a datasink subdirectory in your working directory containing the source code. The git program comes bundled with recent versions of OS X; it can be installed under Ubuntu using sudo apt-get -y install git. Updates can be obtained by running git pull from the datasink subdirectory.

Compiling the Cython module

Several functions are accelerated by Cython and must first be compiled by running make from the git repository directory.

Setting environment variables

Java must be told where Weka is located and how much RAM to use by modifying the CLASSPATH and JAVA_OPTS environment variables. A simple way to set these variables is to add the following to your shell's login script for Ubuntu:

export CLASSPATH=$CLASSPATH:/usr/share/java/weka.jar
export JAVA_OPTS="-Xmx4g"

or for OS X:

export CLASSPATH=$CLASSPATH:/opt/local/share/java/weka.jar
export JAVA_OPTS="-Xmx4g"

The above is Bash syntax and allows Weka to use up to 4 gigs of RAM; adjust accordingly for your setup.

The groovy executable must also be somewhere in your search path (and already is if using the Ubuntu or MacPorts instructions above). If groovy was manually installed, for example under $HOME/groovy on a cluster, add the following to your login script:

export PATH=$PATH:$HOME/groovy/bin

You're finally ready to setup and construct an ensemble!

Walkthrough: Building an ensemble

Ensemble generation requires 3 files, ideally inside a self-contained project directory:

  • Training data in ARFF format
  • A file listing the classifiers to train
  • A weka.properties file pointing to the above files and configuring other pipeline settings

To begin, we create a project directory in our home directory and download an example dataset from the command line:

mkdir ~/diabetes; cd ~/diabetes
curl -O http://repository.seasr.org/Datasets/UCI/arff/diabetes.arff

Next we create a weka.properties file to configure our pipeline:

cat > weka.properties << EOF
classifiersFilename = classifiers.txt
inputFilename = diabetes.arff
classAttribute = class
predictClassValue = tested_positive
balanceTraining = false
foldCount = 10
nestedFoldCount = 10
bagCount = 10
EOF

Finally, we create classifiers.txt containing the Weka classifiers and associated parameters we want included in the ensemble:

cat > classifiers.txt << EOF
weka.classifiers.bayes.NaiveBayes -D
weka.classifiers.functions.SGD -F 1
#weka.classifiers.meta.LogitBoost
EOF

Note that weka.classifiers.meta.LogitBoost is preceded by a comment marker (#); such lines are skipped and thus excluded from ensemble generation. We'll leave LogitBoost commented out for now, and later see how its inclusion changes ensemble performance.

As specified in the weka.properties file, the data is first divided into 10 folds of independent training and test splits for cross validation. Each training split is resampled with replacement 10 times (a process called bagging [Breiman1996]), and nested cross validation is performed on each of these resampled training splits to produce the data necessary for ensemble techniques.

Before generating the ensemble, we first examine the non-ensemble performance of each base classifier using 10-fold cross validation in Weka. Several performance metrics are produced by Weka, but datasink focuses on the area under the receiver operating characteristic (ROC) curve (AUC) since it is well-suited to imbalanced class distributions that often occur with real data. However, any metric can be computed using the CSV files generated by the analysis scripts.

weka.classifiers.bayes.NaiveBayes -D

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.816    0.392    0.795      0.816    0.806      0.429    0.806     0.882     tested_negative
0.608    0.184    0.639      0.608    0.623      0.429    0.806     0.676     tested_positive

weka.classifiers.functions.SGD -F 1

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.888    0.444    0.789      0.888    0.835      0.478    0.832     0.892     tested_negative
0.556    0.112    0.727      0.556    0.630      0.478    0.832     0.713     tested_positive

Next we construct an ensemble of 20 Naive Bayes (NB) and Logistic Regression (LR, trained using Stochastic Gradient Descent) base classifiers; recall each classifier type is bagged 10 times. This takes ~3-4 minutes on a modern 4 core system with 8 gigs of RAM and should decrease linearly with the number of cores:

cd ~/datasink
python generate.py ~/diabetes

Because the code is architected for multicore and distributed environments, many processes are spawned and each writes its output to a unique file. These files must first be merged:

python combine.py ~/diabetes

Ensemble methods are then applied:

python mean.py ~/diabetes
0.836
python stacking.py ~/diabetes standard
0.837 20
python selection.py ~/diabetes greedy
0.841 15
python selection.py ~/diabetes enhanced
0.839 16

The output after each script gives the AUC calculated over all cross validation folds as well as the average size of the ensemble when applicable. The performance of these methods can vary greatly depending on the dataset and in particular the number of training examples, with simpler methods typically performing better for smaller datasets.

Growing the ensemble

Let's add LogitBoost to the ensemble, looking first at its base performance in Weka:

weka.classifiers.meta.LogitBoost

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.846    0.455    0.776      0.846    0.810      0.410    0.810     0.891     tested_negative
0.545    0.154    0.655      0.545    0.595      0.410    0.810     0.668     tested_positive

We add this classifier to the ensemble by editing classifiers.txt and using comment markers (#) as discussed above to exclude the previous classifiers and include LogitBoost, or execute the following command as a shortcut:

cat > ~/diabetes/classifiers.txt << EOF
#weka.classifiers.bayes.NaiveBayes -D
#weka.classifiers.functions.SGD -F 1
weka.classifiers.meta.LogitBoost
EOF

Alternately, leave all lines uncommented to see how the ensemble generation script only produces output for LogitBoost as it recognizes NB and LR are already generated. Now create the LogitBoost classifiers (~2 mins), combine with the previous NB and LR output, and run the ensemble methods:

python generate.py ~/diabetes
python combine.py ~/diabetes
python mean.py ~/diabetes
0.836
python stacking.py ~/diabetes standard
0.841 30
python selection.py ~/diabetes greedy
0.843 17
python selection.py ~/diabetes enhanced
0.840 24

Note that the performance of mean-aggregated predictions remains unchanged, while stacking and selection methods get a small boost.

The importance of bagging

Compare the performance of the Naive Bayes base classifier (0.806) to its bagged performance (0.8254) using mean aggregation: Bagging provides a non-trivial boost. Run cd ~/diabetes and try the following under ipython to see how this simple aggregation method works. We first create a pandas DataFrame object indexed by a unique ID and class label for each example:

from glob import glob
from pandas import concat
from sklearn.metrics import roc_auc_score

df = concat([read_csv(_, compression = 'gzip', index_col = [0, 1]) \
	for _ in glob('predictions-*.csv.gz')])

The probability assigned to the positive class by each resampled classifier is stored across columns, shown here with a number appended to the classifier name for each bagged version:

print df.columns
Index([NaiveBayes.0, NaiveBayes.1, NaiveBayes.2, NaiveBayes.3, NaiveBayes.4, NaiveBayes.5, NaiveBayes.6, NaiveBayes.7, NaiveBayes.8, NaiveBayes.9, SGD.0, SGD.1, SGD.2, SGD.3, SGD.4, SGD.5, SGD.6, SGD.7, SGD.8, SGD.9, LogitBoost.0, LogitBoost.1, LogitBoost.2, LogitBoost.3, LogitBoost.4, LogitBoost.5, LogitBoost.6, LogitBoost.7, LogitBoost.8, LogitBoost.9], dtype=object)

Here we grab the class labels from the index, take the row mean of the first 10 columns corresponding to Naive Bayes, and calculate the AUC:

labels = df.index.get_level_values(1).values
roc_auc_score(labels, df.iloc[:, :10].mean(axis = 1))
0.8254

A similar increase compared to the base classifier is observed for LogitBoost:

roc_auc_score(labels, df.iloc[:, 20:30].mean(axis = 1))
0.8293

but a small dip occurs for Logistic Regression:

roc_auc_score(labels, df.iloc[:, 10:20].mean(axis = 1))
0.8283

Recall the base LR performance is 0.832 which is already quite close to the ensemble's performance. Resampling the training data with replacement to create diversity excludes approximately one third of the training instances due to chance. Given its decreased performance, the loss of these training examples is more detrimental to LR than what it gains from ensembling. However, the dip is relatively small and resampling creates the diversity necessary to increase the overall performance of the ensemble when other classifiers are included.

Heterogeneous vs. homogeneous ensembles

The advantage of using heterogeneous ensembles is clear when we compare their performance to state-of-the-art homogeneous ensemble techniques such as Random Forests. To maximize performance of the forest, we increase the number of trees to 500 (a parameter that doesn't overfit [Breiman2001]) and reduce the maximum tree depth to prevent overfitting on this small dataset.

weka.classifiers.trees.RandomForest -I 500 -depth 5

TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
0.864    0.429    0.790      0.864    0.825      0.458    0.833     0.904     tested_negative
0.571    0.136    0.692      0.571    0.626      0.458    0.833     0.700     tested_positive

Compare the 0.833 AUC of this homogeneous ensemble to the 0.843 achieved above using greedy selection with only 3 classifier types.

Another (condensed) example

Let's see if these trends hold for another dataset:

mkdir ~/liver; cd ~/liver
curl -O http://repository.seasr.org/Datasets/UCI/arff/liver-disorders.arff

cat > weka.properties << EOF
classifiersFilename = classifiers.txt
inputFilename = liver-disorders.arff
classAttribute = selector
predictClassValue = 2
balanceTraining = false
foldCount = 10
nestedFoldCount = 10
bagCount = 10
EOF

cat > classifiers.txt << EOF
weka.classifiers.functions.MultilayerPerceptron
weka.classifiers.lazy.IBk
weka.classifiers.meta.AdaBoostM1
weka.classifiers.rules.JRip
EOF

cd ~/datasink
python generate.py ~/liver
python combine.py ~/liver
python mean.py ~/liver
python selection.py ~/liver greedy
python selection.py ~/liver enhanced
python stacking.py ~/liver standard
Method AUC Notes
AdaBoostM1 0.684 DecisionStump base learner
IBk 0.637
JRip 0.653
MultilayerPerceptron 0.742
RandomForest 0.768 200 trees, excluded from ensemble
mean 0.772
greedy 0.772
enhanced 0.764
stacking 0.775 RandomForest stacker, max_depth = 5

This time we use a different set of classifiers and give the performance of a random forest for reference. Note that 40 bagged heterogeneous classifiers outperform a random forest of 200 trees for three out of four aggregation methods. Though enhanced ensemble selection has not performed as well as simpler methods in these examples, it tends to perform similar to stacking for larger datasets where greedy selection begins to fall behind. It is important to emphasize that differences in performance should be evaluated for statistical significance; see [Demšar2006] for a review of non-parametric comparison methods. In a paper currently under review, we find statistically significant differences between heterogeneous ensembles and the best base classifier (homogeneous ensembles) for several complex, real-world datasets.

UCI benchmarks

The UCI machine learning repository provides datasets for benchmarking machine learning algorithms. Below is the AUC for several UCI binary classification datasets using 3 classifier types discussed above: NB, LR, and LogitBoost. These numbers are provided only for verification purposes: This small ensemble will likely be out-performed by a single well-tuned classifier (often a Random Forest or gradient boosted trees), and for many datasets the best classifier will be out-performed by a larger heterogeneous ensemble. To get more experience with datasink, try adding new classifiers until the ensemble beats the best classifier for some of these datasets.

The highest AUC is bolded for each dataset, and ties are broken by preferring the simplest method. Again, one must perform tests for statistical significance such as those presented in [Demšar2006] to draw sound conclusions about performance differences, and more complex methods often require similarly complex, large, real-world datasets to demonstrate their utility.

Dataset Instances Mean Stacking Greedy Enhanced
breast-cancer 286 0.683 0.67 0.704 (3) 0.691 (19)
breast-w 699 0.993 0.992 0.993 (5) 0.993 (17)
colic 368 0.872 0.883 0.874 (9) 0.875 (25)
credit-a 690 0.934 0.933 0.933 (9) 0.935 (25)
credit-g 1000 0.785 0.793 0.795 (7) 0.794 (25)
diabetes 768 0.836 0.84 0.842 (18) 0.841 (25)
haberman 306 0.662 0.662 0.672 (5) 0.676 (24)
heart-statlog 270 0.905 0.908 0.906 (6) 0.908 (24)
ionosphere 351 0.967 0.96 0.960 (9) 0.971 (21)
kr-vs-kp 3196 0.993 0.996 0.996 (7) 0.996 (26)
labor 57 0.979 1.000 0.980 (4) 0.980 (6)
liver-disorders 345 0.742 0.768 0.780 (7) 0.758 (21)
molecular-biology_promoters 106 0.977 0.97 0.970 (11) 0.956 (18)
mushroom 8124 1.000 1.000 1.000 (2) 1.000 (2)
sick 3772 0.973 0.963 0.977 (11) 0.978 (27)
sonar 208 0.873 0.895 0.876 (5) 0.905 (23)
spambase 4601 0.976 0.972 0.978 (15) 0.978 (27)
tic-tac-toe 958 0.982 0.996 0.997 (5) 0.997 (24)
vote 435 0.991 0.992 0.991 (9) 0.992 (23)

Notes

If a particular class is extremely uncommon, the bagging process may (by chance) produce training splits that do not contain that class due to sampling with replacement. Bagging may not be appropriate in these scenarios and can be disabled in the properties file.

Reference: weka.properties

Property Type Default Description
classifiersFilename String Required File containing a list of full Java classnames and parameters, one per line, of classifiers to include in the ensemble. Lines beginning with a hash (#) are skipped.
inputFilename String Required A Weka-formatted ARFF file containing features and class labels.
workingDir String Current directory Location to store classifier outputs.
classAttribute String Required Name of the ARFF attribute containing class labels. Often the last attribute.
predictClassValue String Required Value of the positive class for classAttribute. For example, this could be 1 for instances with 0/1 class labels, or tested_positive for the walkthrough dataset.
balanceTraining Boolean true Balance the class distribution of the training set inside each cross validation fold after any resampling, using Weka's SpreadSubsample filter with -M 1.
balanceTest Boolean false Identical to balanceTraining for the test set. Note that best practice for non-uniform class distributions is to balance the training set only, then evaluate against the natural class distribution of the test set [Weiss2003,Tan2013].
foldCount Integer Required* Number of cross validation folds to use. This or foldAttribute must be specified.
foldAttribute String Required* Name of the ARFF attribute containing values for leave-one-value-out cross validation. This or foldCount must be specified.
nestedFoldCount Integer Required Number of nested cross validation folds to use for each cross validated training set. Greatly increases execution time.
bagCount Integer Required Number of resampled versions of each base classifier to generate. Greatly increases execution time. A value of 0 disables resampling.
useCluster Boolean false Submit jobs to a distributed computing cluster (using qsub, for example) instead of spawning processes on the local machine.
writeModels Boolean false Save compressed, serialized models for each classifier/fold/bag combination to disk. Substantially increases disk usage.

Bibliography

  • Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140. doi:10.1023/A:1018054314350
  • Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
  • Caruana, R., Niculescu-Mizil, A., Crew, G., & Ksikes, A. (2004). Ensemble Selection from Libraries of Models. In Proceedings of the 21st International Conference on Machine Learning (pp. 18–26). doi:10.1145/1015330.1015432
  • Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7(Jan), 1–30.
  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1), 10–18. doi:10.1145/1656274.1656278
  • McKinney, W. (2012). Python for Data Analysis. O’Reilly.
  • Pedregosa, F., Varoquaux, G., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.
  • Tan, P.-N., Steinbach, M., & Kumar, V. (2013). Introduction to Data Mining (2nd ed.). Addison-Wesley.
  • Weiss, G. M., & Provost, F. J. (2003). Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19(1), 315–354.
  • Whalen, S., & Pandey, G. (2013). A Comparative Analysis of Ensemble Classifiers: Case Studies in Genomics. In Proceedings of the 13th International Conference on Data Mining (pp. 807–816). doi:10.1109/ICDM.2013.21
  • Wolpert, D. H. (1992). Stacked Generalization. Neural Networks, 5(2), 241–259. doi:10.1016/S0893-6080(05)80023-1

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published