SBLAJ -- Sparse Basic Linear Algebra for the JVM -- is a library that provides standard operations on sparse matrices, with a focus on operations required for Machine Learning. The core goals are:
- Extremeley memory efficient for large, sparse matrices
- Optimized storage for boolean data
- Handles > billions of entries
- Clean API for common matrix operations
- Cache-friendly storage
- Fast -- competitive with application specific code
- Compact data formats -- small on disk and fast to load
The extended packages also provide:
- Implementations of standard machine learning algorithms that are designed for large, sparse data, including
- Naive Bayes
- Stochastic Gradient Descent
- LDA
- Utilities to get data into the Sparse Matrix Formats
- Compatability with Hadoop and Spark, for distributed computing
SBLAJ can be built using sbt. Simply check out the source and run
sbt package
to create a usable jar. You can try out the code by running the units tests from within sbt, either with
test
to run all the tests, or eg. ~test-only org.sblaj.BaseSparseBinaryVectorTest
to repeatedly
run one set of tests as you modify the code
SBLAJ is still pre-0.1. Some basic ideas are in place, but the API is still in flux. It is actively developed, and the intention is to be a well-tested, robust library, not simply a toy demonstrating one idea. Contributions and ideas from others would be greatly appreciated!
- SparseBinaryRowMatrix -- iterators & helper functions
- NaiveBayes
- Matrix Creation Utils
- Clean api to featurize, creating header, matrix, and dictionary
- transform "hashed" to "enumerated" features
-
SGD
-
Hadoop reader / writer
-
Matrix Market Import & Export
-
Spark API
-
Timing Comparisons
-
LDA
-
Integration with Scalala
-
Std filters & subsets, eg. min occurrence, min correlation, etc.
-
Additional Matrix Types
- Float, Column
-
Varint encoded matrices
-
Views -- api to index sub-portions of matrix, with lazy & force version
-
Mixed-mode matrices
- dense float + sparse binary (eg., time series)
-
HMM implementation