added triangular matrix multiplication kernel #214
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Companion to #213, adding a file specifically for the development of this matmul.
Also shows different intermediate kernels on the way towards efficiency.
To give a break from all the maths and indexing in the code, the development of these is described as a story.
Some of the metaphors are stretched quite a bit, so feel free to make adjustments, but I hope that overall, this might be easier to follow than just "indexing with this formula to achieve coalesced access".
Currently, the reads in the inner loop still cause 2-way bank conflicts, so there is still room for improvement.
Timings on my machine:
Given that we're doing only half the work, that leaves us still 20% less efficient than cuBLAS.