Sparse matrix vector product benchmark (in comparison with spark and dask) #68
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The numbers on the 3Mx3M test matrix from https://snap.stanford.edu/data/com-Orkut.html look like this:
scipy.sparse single threaded: 1.2s
halo (1 node, 4 workers): 600ms
halo (2 nodes, 4 workers each): 430ms
dask (4 workers): 1.0s
dask.distributed (1 node, 4 workers): 11s
dask.distributed (2 nodes, 4 workers each): 8.1s
Distributed Dask presumably does not perform well, because it does not have an object store where the sparse matrix blocks can be stored. The single node version of dask does not need to perform serialization, but is limited by the Python GIL.
For pyspark, the full matrix gave a serialization error; using a 2Mx2M matrix gives:
scipy.sparse single threaded: 0.76s
spark (1 node, 4 workers): 1.41s
spark (2 node, 4 workers each): 1.56s
Before this is merged, we should check with the author of Dask that there is not a more efficient way to implement these operations.