-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block matrices in BLAS methods #224
Comments
Many of these don't exist for legacy reasons (when they were originally written, Might I ask why you are preferring the blocked distribution? It was primarily meant for implementing the distributed Hessenberg QR algorithm and for legacy interfaces to ScaLAPACK. |
Yes, it didn't look too hard to make the changes. I'll try and put together a pull request adding support for these methods. The use case is for better distributed GEMM performance in code where the matrix multiplies are a major bottleneck (training neural networks). |
But, why use block distributions for the distributed GEMM? The element-wise distributions will have better performance in Elemental. There are a few legitimate reasons to demand block distributions (e.g., for bulge-chasing algorithms like the Hessenberg QR algorithm), but element-wise distributions are much simpler and block distributions are not quite first-class in the library (as you are finding). |
I benchmarked this some time ago and found the opposite conclusion-- that block matrices ended up quite a bit faster, and didn't think it unusual (since it agrees with what I recall from theory). However, I just rewrote and ran a quick benchmark and the results agree with you: the element-wise distribution is faster. (Seems to be about 60-75% faster.) Which is confusing, but perhaps there was an issue with the prior benchmark or how I ran it. Is the performance difference due to something inherent in the element-wise distribution, or just Elemental's distributed GEMM implementations? |
Despite popular opinion, there is nothing about element-wise distributions that effects the performance of GEMM relative to a block distribution, as each process stores its portion of matrices locally and can make use of BLAS 3. Further, nothing would prevent an implicit permutation of the rows and columns so that one could run the same algorithm as for the blocked case on an implicit permutation (though this doesn't hold for factorizations because, for example, triangular structure is not preserved under such permutations). The only change in communication pattern to usual blocked algorithms is the usage of The current implementation of |
Okay, that's good to know, and explains quite well the performance I'm seeing with my latest benchmark. (The performance difference between block and elemental distributions is larger when comparing a GEMM on 2^14 x 2^14 square matrices than with 2^15 x 2^15 square matrices.) I already have code to support block distributions in the operations I mentioned above, and I don't think there's any sense in not contributing it. I'll make a pull request after I have another pair of eyes look over it for any issues. |
Great, I'll be looking forward to the PR! |
It looks like several of the BLAS methods currently only support element-wise matrices as arguments, and do not support block-wise matrices.
Functions I particularly care about are:
Hadamard
Dot
(which implies support inHilbertSchmidt
)ColumnTwoNorms
ColumnMaxNorms
I haven't looked, but I suspect there are more that don't support it that I'm not using right now.
The text was updated successfully, but these errors were encountered: