Skip to content
forked from PasaLab/marlin

A distributed matrix operations library built on top of Spark

Notifications You must be signed in to change notification settings

tyzhang1993/saury

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Saury

A distributed matrix operations library build on top of Spark. Now, the master branch is in version 0.1-SNAPSHOT.

##Prerequisites As Saury is built on top of Spark, you need to get the Spark installed first. If you are not clear how to setup Spark, please refer to the guidelines here. Currently, Saury is developed on the APIs of Spark 1.0.x version.

##Compile Saury We have offered a default build.sbt file, make sure you have installed sbt, and you can just type sbt assembly to get a assembly jar.

Note: In build.sbt file, the default Spark Version is 1.0.1, and the default Hadoop version is 2.3.0, you can modify the build.sbt file to fit your environment.

##Run Saury We have already offered some examples in edu.nju.pasalab.examples to show how to use the APIs in the project. For example, if you want to run two large matrices multiplication, use spark-submit method, and type in command

$.bin/spark-submit \
 --class edu.nju.pasalab.examples.MatrixMultiply
 --master <master-url> \
 --executor-memory <memory> \
 saury-assembly-0.1-SNAPSHOT.jar \
 <input file path A> <input file path B> <output file path> <block num>

Notice: Because the pre-built Spark-assembly jar doesn't have any files about netlib-java native compontent, which means you cannot use the native linear algebra library(e.g BLAS)to accelerate the computing, but have to use pure java to perform the small block matrix multiply in every worker. We have done some experiments and find it has a significant performance difference between the native BLAS computing and the pure java one, here you can find more info about the performance comparison and how to load native library.

Note:<input file path A> is the file path contains the text-file format matrix. We recommand you put it in the hdfs, and in directory data we offer two matrix files, in which every row of matrix likes: 7:1,2,5,2.0,3.19,0,... the 7 before : means this is the 8th row of this matrix (the row index starts from 0), and the numbers after : splited by , represent each column element in the row.

Note: <block num> is the split nums of sub-matries, if you set it as 10, which means you split every original large matrix into 10*10=100 blocks. In fact, this parameter is the degree of parallelism. The smaller this argument is, the bigger submatrix every worker will get. When doing experiments, we multiply two 20000 by 20000 matrix together, we set it as 10.

##Martix Operations API in Saury Currently, we have finished below APIs:

Operation API
Matrix-Matrix addition add(B: IndexMatrix)
Matrix-Matrix minus minus(B: IndexMatrix)
Matrix-Matrix multiplication multiply(B: IndexMatrix, blkNum: Int)
Elementwise addition elemWiseAdd(b: Double)
Elementwise minus elemWiseMinus(b: Double) / elemWiseMinusBy(b: Double)
Elementwise multiplication elemWiseMult(b: Double)
Elementwise division elemWiseDivide(b: Double) / elemWiseDivideBy(b: Double)
Get submatrix according to row sliceByRow(startRow: Long, endRow: Long)
Get submatrix according to column sliceByColumn(startCol: Int, endCol: Int)
Get submatrix getSubMatrix(startRow: Long, endRow: Long ,startCol: Int, endCol: Int)
LU decomposition LUDecompose(mode: String = "auto")

##Algorithms and Performance Evaluation ###Algorithms Currently, we implement the matrix manipulation on Spark with block matrix parallel algorithms to distribute large scale matrix computation among cluster nodes. The details of the matrix multiplication algorithm is here.

###Performance Evaluation We have done some performance evaluation of Saury. It can be seen here. We wiil update the wiki page when more results are carried out.

##The relationship between Saury and MLlib Matrix MLlib contains quite a lot of general representations of Matrix. Saury extends some of them and provides the distributed manipulation for the matrices.

We override class IndexedRow in MLlib,it is still a (Long, Vector) wraper, usage is the same as IndexedRow .

We override class IndexedRowMatrix in MLlib,from RDD[IndexedRow] to RDD[IndexRow], usage is the same as IndexedRowMatrix . Moreover, you can transfom a local Array[Array[Double]] to a IndexMatrix by using new IndexMatrix(sc, array) method.

The MTUtils can load file-format matrix from hdfs and tachyon, with loadMatrixFile(sc: SparkContext, path: String, minPatition: Int = 3) method

About

A distributed matrix operations library built on top of Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published