added triangular matrix multiplication kernel #214

ngc92 · 2024-04-22T11:54:40Z

Companion to #213, adding a file specifically for the development of this matmul.
Also shows different intermediate kernels on the way towards efficiency.

To give a break from all the maths and indexing in the code, the development of these is described as a story.
Some of the metaphors are stretched quite a bit, so feel free to make adjustments, but I hope that overall, this might be easier to follow than just "indexing with this formula to achieve coalesced access".

Currently, the reads in the inner loop still cause 2-way bank conflicts, so there is still room for improvement.

Timings on my machine:

time 1.40 ms vs 2.37 ms for CuBLAS

Given that we're doing only half the work, that leaves us still 20% less efficient than cuBLAS.

karpathy · 2024-04-22T14:56:30Z

Wow, you really had a lot of fun with the TriMatlon 😂 😂 😂
The most incredible fusion of art and engineering I've seen yet :D

ngc92 added 2 commits April 22, 2024 14:45

added triangular matrix multiplication kernel

caa69a5

added NaN-based makes for reference checks

732a8b4

karpathy merged commit 7830cf6 into karpathy:master Apr 22, 2024
3 checks passed

ngc92 deleted the trimul branch April 28, 2024 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added triangular matrix multiplication kernel #214

added triangular matrix multiplication kernel #214

ngc92 commented Apr 22, 2024

karpathy commented Apr 22, 2024

added triangular matrix multiplication kernel #214

added triangular matrix multiplication kernel #214

Conversation

ngc92 commented Apr 22, 2024

karpathy commented Apr 22, 2024