Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hcc::__activelaneid_u32() doesn't seem to work in HCC 2.0 #1011

Closed
dragontamer opened this issue Jan 29, 2019 · 6 comments
Closed

hcc::__activelaneid_u32() doesn't seem to work in HCC 2.0 #1011

dragontamer opened this issue Jan 29, 2019 · 6 comments
Assignees

Comments

@dragontamer
Copy link

This was originally reported in the ROCm issue tracker. (ROCm/ROCm#688) They asked me to report it here instead.


Summary of your hardware: Threadripper 1950x + Vega64

PCIe Information: PCIe 3.0.

Here's my test code.

#include<iostream>
#include<stdint.h>

#include <hc.hpp>

int main(){
	const int N = (1 << 30) / sizeof(uint32_t); 
	uint32_t* stuff = new uint32_t[N];
	uint32_t* output = new uint32_t[N];

	for(int i=0; i<N; i++){
		stuff[i] = i;
	}

	hc::array_view<uint32_t, 1> av_in(N, stuff);
	hc::array_view<uint32_t, 1> av_out(N, output);

	hc::parallel_for_each(hc::extent<1>(N), [=](hc::index<1> i) [[hc]] {
		av_out[i[0]] = hc::__activelaneid_u32();
	});

	for(int i=0; i< 100; i++){
		std::cout << i << "    " << av_out[i] << "\n"; 
	}
}

When I compile...

hcc `hcc-config --cxxflags --ldflags` test.cpp -o test
ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol __activelaneid_u32; recompile with -fPIC
>>> defined in /tmp/tmp.hO38AglmSw/kernel-gfx900.hsaco.isabin
>>> referenced by /tmp/tmp.hO38AglmSw/kernel-gfx900.hsaco.isabin:(main::$_0::__cxxamp_trampoline(unsigned int*, int, int, int, int, int, int, int))

ld.lld: error: relocation R_AMDGPU_REL32_HI cannot be used against symbol __activelaneid_u32; recompile with -fPIC
>>> defined in /tmp/tmp.hO38AglmSw/kernel-gfx900.hsaco.isabin
>>> referenced by /tmp/tmp.hO38AglmSw/kernel-gfx900.hsaco.isabin:(main::$_0::__cxxamp_trampoline(unsigned int*, int, int, int, int, int, int, int))
Generating AMD GCN kernel failed in ld.lld for target: gfx900
clang-8: error: linker command failed with exit code 1 (use -v to see invocation)
Makefile:2: recipe for target 'test' failed
make: *** [test] Error 1

Version information:

hcc --version
HCC clang version 8.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 6ec3c61e09fbb60373eaf5a40021eb862363ba2c) (ssh://gerritgit/lightning/ec/llvm ab3b88ffc2ae50f55361a49aec89f6e95d9d0ec4) (based on HCC 1.3.18482-757fb49-6ec3c61-ab3b88f )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/bin

For now, my code is using the following as a workaround.

int active_laneid(void) [[hc]] {
        int toReturn;
        asm volatile(
                "v_mbcnt_lo_u32_b32 %0, exec_lo, 0 \n"
                "v_mbcnt_hi_u32_b32 %0, exec_hi, %0 \n"
                : "=v" (toReturn)
           );
        return toReturn;
}
@ex-rzr
Copy link

ex-rzr commented Feb 1, 2019

FYI, there is hc::__lane_id which does the same but without asm.

@dragontamer
Copy link
Author

dragontamer commented Feb 1, 2019

FYI, there is hc::__lane_id which does the same but without asm.

Not quite. __lane_id has -1 as the mask to mbcnt. "Active_laneid()" has "exec_lo" and "exec_hi" as the mask.

  // Untested conceptual example
  hc::parallel_for_each(hc::extent<1>(N)
                      , [=](hc::index<1> i) [[hc]] {
    if(i[0] % 5 == 0){ // 1/5th of lanes are active
      active[i[0]] = active_laneid();
      laneid[i[0]] = hc::__lane_id();
    } else {
      active[i[0]] = -1;
      laneid[i[0]] = -1;
    }
});

The active[] array will be {0, -1, -1, -1, -1, 1, -1, -1, -1, -1, 2, -1, -1, -1, -1, -1, 3 ...}

While laneid[] array will be {0, -1, -1, -1, -1, 5, -1, -1, -1, -1, 10, -1, -1, -1, -1, -1, 15...}

In effect, laneid[] always returns the same number, regardless of the execution mask. Active_laneid() depends on the execution mask, so it can be used for reading/writing from a queue without the need of atomics (once combined with this functionality I requested).

  // Another conceptual example, untested code
  // Assume single wavefront. I haven't figured out barriers yet for larger workgroups
  hc::parallel_for_each(hc::tiled_extent<1>(64) 
                      , [=](tiled_index<1> t_idx) [[hc]] {
    tile_static int queue_head = 0; // Yeah, I know that's not how it works. But for simplicity's sake...
    tile_static int queue[5000];

    if( someComplexCondition()){ // We don't know the lane access pattern
      queue[queue_head + active_laneid()] = fooBar();
      __builtin_amdgcn_ds_append(&queue_head);
    }
});

The above code (when __builtin_amdgcn_ds_append is finally added to HCC), should add data efficiently to the tile_static queue.

@ex-rzr
Copy link

ex-rzr commented Feb 1, 2019

Right, my mistake. I didn't notice exec in your code.

However, I'm not sure that your code with ds_append is correct. Have you checked it with asm?

Add (count_bits(exec_mask)) to the value stored in DS memory at (M0.base + instr_offset). Return the pre-operation value to VGPRs.

So it should be like this:

    tile_static int queue_head = 0; // Yeah, I know that's not how it works. But for simplicity's sake...
    tile_static int queue[5000];
    ... barrier

    if( someComplexCondition()){ // We don't know the lane access pattern
        const int p = __builtin_amdgcn_ds_append(&queue_head);
        queue[p + active_laneid()] = fooBar();
    }

@dragontamer
Copy link
Author

dragontamer commented Feb 1, 2019

However, I'm not sure that your code with ds_append is correct. Have you checked it with asm?

Not quite yet. Just pointing out why I'm interested in __activelaneid_u32(). I haven't tested (or written) any code like that yet, its just conceptual.

Overall, I'm trying to figure out a cheap and easy pattern for load-balancing (or work-stealing? Work-sharing??) on a GPU. This queue thing seems like it could be a useful pattern to build off of.

@david-salinas david-salinas self-assigned this Feb 1, 2019
@david-salinas
Copy link
Collaborator

Thanks. We're looking into this.

@david-salinas
Copy link
Collaborator

PR #1018 was merged last week. Closing this Issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants