Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block matrices in BLAS methods #224

Open
ndryden opened this issue Mar 21, 2017 · 7 comments
Open

Block matrices in BLAS methods #224

ndryden opened this issue Mar 21, 2017 · 7 comments

Comments

@ndryden
Copy link
Contributor

ndryden commented Mar 21, 2017

It looks like several of the BLAS methods currently only support element-wise matrices as arguments, and do not support block-wise matrices.

Functions I particularly care about are:

  • Hadamard
  • Dot (which implies support in HilbertSchmidt)
  • ColumnTwoNorms
  • ColumnMaxNorms

I haven't looked, but I suspect there are more that don't support it that I'm not using right now.

@poulson
Copy link
Member

poulson commented Mar 24, 2017

Many of these don't exist for legacy reasons (when they were originally written, El::DistMatrix<T,colDist,rowDist,El::BLOCK> did not yet exist), but each should only require a few lines of code to add support for.

Might I ask why you are preferring the blocked distribution? It was primarily meant for implementing the distributed Hessenberg QR algorithm and for legacy interfaces to ScaLAPACK.

@ndryden
Copy link
Contributor Author

ndryden commented Mar 24, 2017

Yes, it didn't look too hard to make the changes. I'll try and put together a pull request adding support for these methods.

The use case is for better distributed GEMM performance in code where the matrix multiplies are a major bottleneck (training neural networks).

@poulson
Copy link
Member

poulson commented Mar 24, 2017

But, why use block distributions for the distributed GEMM? The element-wise distributions will have better performance in Elemental. There are a few legitimate reasons to demand block distributions (e.g., for bulge-chasing algorithms like the Hessenberg QR algorithm), but element-wise distributions are much simpler and block distributions are not quite first-class in the library (as you are finding).

@ndryden
Copy link
Contributor Author

ndryden commented Mar 24, 2017

I benchmarked this some time ago and found the opposite conclusion-- that block matrices ended up quite a bit faster, and didn't think it unusual (since it agrees with what I recall from theory). However, I just rewrote and ran a quick benchmark and the results agree with you: the element-wise distribution is faster. (Seems to be about 60-75% faster.) Which is confusing, but perhaps there was an issue with the prior benchmark or how I ran it.

Is the performance difference due to something inherent in the element-wise distribution, or just Elemental's distributed GEMM implementations?

@poulson
Copy link
Member

poulson commented Mar 24, 2017

Despite popular opinion, there is nothing about element-wise distributions that effects the performance of GEMM relative to a block distribution, as each process stores its portion of matrices locally and can make use of BLAS 3. Further, nothing would prevent an implicit permutation of the rows and columns so that one could run the same algorithm as for the blocked case on an implicit permutation (though this doesn't hold for factorizations because, for example, triangular structure is not preserved under such permutations).

The only change in communication pattern to usual blocked algorithms is the usage of MPI_Allgather rather than MPI_Bcast, but their performance is essentially the same.

The current implementation of El::Gemm for distributed dense matrices uses El::ReadProxy and El::ReadWriteProxy to redistribute into an El::DistMatrix<T,El::MC,El::MR> distribution, so if you pass in block distributed matrices, there are four extra communication steps (A, B, and C into elemental form, and C back into block form). This should account for the performance differences you are seeing (and, as a result, as the matrix dimensions go to infinity, you should see the relative performance converge since these are quadratic costs in a cubic cost algorithm).

@ndryden
Copy link
Contributor Author

ndryden commented Mar 24, 2017

Okay, that's good to know, and explains quite well the performance I'm seeing with my latest benchmark. (The performance difference between block and elemental distributions is larger when comparing a GEMM on 2^14 x 2^14 square matrices than with 2^15 x 2^15 square matrices.)

I already have code to support block distributions in the operations I mentioned above, and I don't think there's any sense in not contributing it. I'll make a pull request after I have another pair of eyes look over it for any issues.

@poulson
Copy link
Member

poulson commented Mar 24, 2017

Great, I'll be looking forward to the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants