-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to overlap the share2register and computing process? #14
Comments
事情有点弯弯绕,用中文表达力好点,勿怪。 做 ping-pong 的前提是要有两个相互独立的主体。在计算机体系结构里,运算用 ALU、搬数据用 MMU,这俩就是两个独立主体——一边算、一边搬。 字面上,发命令本身只需要 1 一个cycle,搬数据整个动作要 100 个 cycle。 形象一点举例:你有个马仔,你命令马仔去卖冰,你本人负责制造冰。伪代码里你的工作: 造冰(0) // 100 个 cycle
卖冰(马仔,ptr_冰0) // 1 个 cycle 发指令
造冰(1) // 100 个 cycle
卖冰(马仔,ptr_冰1) // 1 个 cycle 发指令 马仔的工作: recv_卖冰_cmd(ptr_冰0)
do_卖冰(ptr_冰0) // 100 个 cycle
recv_卖冰_cmd(ptr_冰1)
do_卖冰(ptr_冰1) // 100 个 cycle 这时候完成了并行化, 302 (202 + 100) 个 cycle 后任务结束,两个主体一共做了 402 cycle 的工作. 回到原始问题上, part1 和 part2 代码上是串行, 执行由不同硬件来. |
冰 == 冰粉, 成都美食. |
Thank you for your reply. There's no Sync between the part1 and part2, so I think that part1 and part2 run sequentially. I asked my colleague and he said that part1 and part2 are parallel in the hardware and it is register that ensure s2r is finished before computing. His explanation is same as what you said. |
I asked him by using cutlass code which has same pipeline as your code. I also want to know why you use PTX, what's the advantage of asm code? |
The PTX on cuda is not powerful as |
You can just use C code, the gflops should be same. |
I have another question about MMult_cuda_12.cu
Honestly, I don't know how to overlap the share2register and computing process. Is it the asm(PTX) that make them run parallelly? The instructions are sequantially, so how could these two parts of code hide each other?
part1: loading shared-memory to panel
lds128(panelA[pp][0], panelA[pp][1], panelA[pp][2], panelA[pp][3],
aptr_base + ((subk + 1) % 8) * SMEM_LDA * sizeof(float));
lds128(panelA[pp][4], panelA[pp][5], panelA[pp][6], panelA[pp][7],
aptr_base + (((subk + 1) % 8) * SMEM_LDA + 64) * sizeof(float));
lds128(panelB[pp][0], panelB[pp][1], panelB[pp][2], panelB[pp][3],
bptr_base + ((subk + 1) % 8) * SMEM_LDB * sizeof(float));
lds128(panelB[pp][4], panelB[pp][5], panelB[pp][6], panelB[pp][7],
bptr_base + (((subk + 1) % 8) * SMEM_LDB + 64) * sizeof(float));
part2: computing the result of panel-data
#pragma unroll
for (int i = 0; i < 8; ++i) {
#pragma unroll
for (int j = 0; j < 8; ++j) {
sum[i][j] += panelA[subk % 2][i] * panelB[subk % 2][j];
}
}
The text was updated successfully, but these errors were encountered: