hcc::__activelaneid_u32() doesn't seem to work in HCC 2.0 #1011

dragontamer · 2019-01-29T21:49:26Z

This was originally reported in the ROCm issue tracker. (ROCm/ROCm#688) They asked me to report it here instead.

Summary of your hardware: Threadripper 1950x + Vega64

PCIe Information: PCIe 3.0.

Here's my test code.

#include<iostream>
#include<stdint.h>

#include <hc.hpp>

int main(){
	const int N = (1 << 30) / sizeof(uint32_t); 
	uint32_t* stuff = new uint32_t[N];
	uint32_t* output = new uint32_t[N];

	for(int i=0; i<N; i++){
		stuff[i] = i;
	}

	hc::array_view<uint32_t, 1> av_in(N, stuff);
	hc::array_view<uint32_t, 1> av_out(N, output);

	hc::parallel_for_each(hc::extent<1>(N), [=](hc::index<1> i) [[hc]] {
		av_out[i[0]] = hc::__activelaneid_u32();
	});

	for(int i=0; i< 100; i++){
		std::cout << i << "    " << av_out[i] << "\n"; 
	}
}

When I compile...

hcc `hcc-config --cxxflags --ldflags` test.cpp -o test
ld.lld: error: relocation R_AMDGPU_REL32_LO cannot be used against symbol __activelaneid_u32; recompile with -fPIC
>>> defined in /tmp/tmp.hO38AglmSw/kernel-gfx900.hsaco.isabin
>>> referenced by /tmp/tmp.hO38AglmSw/kernel-gfx900.hsaco.isabin:(main::$_0::__cxxamp_trampoline(unsigned int*, int, int, int, int, int, int, int))

ld.lld: error: relocation R_AMDGPU_REL32_HI cannot be used against symbol __activelaneid_u32; recompile with -fPIC
>>> defined in /tmp/tmp.hO38AglmSw/kernel-gfx900.hsaco.isabin
>>> referenced by /tmp/tmp.hO38AglmSw/kernel-gfx900.hsaco.isabin:(main::$_0::__cxxamp_trampoline(unsigned int*, int, int, int, int, int, int, int))
Generating AMD GCN kernel failed in ld.lld for target: gfx900
clang-8: error: linker command failed with exit code 1 (use -v to see invocation)
Makefile:2: recipe for target 'test' failed
make: *** [test] Error 1

Version information:

hcc --version
HCC clang version 8.0.0 (ssh://gerritgit/compute/ec/hcc-tot/clang 6ec3c61e09fbb60373eaf5a40021eb862363ba2c) (ssh://gerritgit/lightning/ec/llvm ab3b88ffc2ae50f55361a49aec89f6e95d9d0ec4) (based on HCC 1.3.18482-757fb49-6ec3c61-ab3b88f )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/bin

For now, my code is using the following as a workaround.

int active_laneid(void) [[hc]] {
        int toReturn;
        asm volatile(
                "v_mbcnt_lo_u32_b32 %0, exec_lo, 0 \n"
                "v_mbcnt_hi_u32_b32 %0, exec_hi, %0 \n"
                : "=v" (toReturn)
           );
        return toReturn;
}

The text was updated successfully, but these errors were encountered:

ex-rzr · 2019-02-01T10:12:18Z

FYI, there is hc::__lane_id which does the same but without asm.

dragontamer · 2019-02-01T15:38:20Z

FYI, there is hc::__lane_id which does the same but without asm.

Not quite. __lane_id has -1 as the mask to mbcnt. "Active_laneid()" has "exec_lo" and "exec_hi" as the mask.

  // Untested conceptual example
  hc::parallel_for_each(hc::extent<1>(N)
                      , [=](hc::index<1> i) [[hc]] {
    if(i[0] % 5 == 0){ // 1/5th of lanes are active
      active[i[0]] = active_laneid();
      laneid[i[0]] = hc::__lane_id();
    } else {
      active[i[0]] = -1;
      laneid[i[0]] = -1;
    }
});

The active[] array will be {0, -1, -1, -1, -1, 1, -1, -1, -1, -1, 2, -1, -1, -1, -1, -1, 3 ...}

While laneid[] array will be {0, -1, -1, -1, -1, 5, -1, -1, -1, -1, 10, -1, -1, -1, -1, -1, 15...}

In effect, laneid[] always returns the same number, regardless of the execution mask. Active_laneid() depends on the execution mask, so it can be used for reading/writing from a queue without the need of atomics (once combined with this functionality I requested).

  // Another conceptual example, untested code
  // Assume single wavefront. I haven't figured out barriers yet for larger workgroups
  hc::parallel_for_each(hc::tiled_extent<1>(64) 
                      , [=](tiled_index<1> t_idx) [[hc]] {
    tile_static int queue_head = 0; // Yeah, I know that's not how it works. But for simplicity's sake...
    tile_static int queue[5000];

    if( someComplexCondition()){ // We don't know the lane access pattern
      queue[queue_head + active_laneid()] = fooBar();
      __builtin_amdgcn_ds_append(&queue_head);
    }
});

The above code (when __builtin_amdgcn_ds_append is finally added to HCC), should add data efficiently to the tile_static queue.

ex-rzr · 2019-02-01T17:00:15Z

Right, my mistake. I didn't notice exec in your code.

However, I'm not sure that your code with ds_append is correct. Have you checked it with asm?

Add (count_bits(exec_mask)) to the value stored in DS memory at (M0.base + instr_offset). Return the pre-operation value to VGPRs.

So it should be like this:

    tile_static int queue_head = 0; // Yeah, I know that's not how it works. But for simplicity's sake...
    tile_static int queue[5000];
    ... barrier

    if( someComplexCondition()){ // We don't know the lane access pattern
        const int p = __builtin_amdgcn_ds_append(&queue_head);
        queue[p + active_laneid()] = fooBar();
    }

dragontamer · 2019-02-01T18:18:15Z

However, I'm not sure that your code with ds_append is correct. Have you checked it with asm?

Not quite yet. Just pointing out why I'm interested in __activelaneid_u32(). I haven't tested (or written) any code like that yet, its just conceptual.

Overall, I'm trying to figure out a cheap and easy pattern for load-balancing (or work-stealing? Work-sharing??) on a GPU. This queue thing seems like it could be a useful pattern to build off of.

david-salinas · 2019-02-01T20:04:51Z

Thanks. We're looking into this.

david-salinas · 2019-02-20T16:40:01Z

PR #1018 was merged last week. Closing this Issue.

david-salinas self-assigned this Feb 1, 2019

david-salinas mentioned this issue Feb 1, 2019

have activelaneid_u32 call ockl_activelane_u32 #1018

Merged

david-salinas closed this as completed Feb 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hcc::__activelaneid_u32() doesn't seem to work in HCC 2.0 #1011

hcc::__activelaneid_u32() doesn't seem to work in HCC 2.0 #1011

dragontamer commented Jan 29, 2019

ex-rzr commented Feb 1, 2019

dragontamer commented Feb 1, 2019 •

edited

Loading

ex-rzr commented Feb 1, 2019 •

edited

Loading

dragontamer commented Feb 1, 2019 •

edited

Loading

david-salinas commented Feb 1, 2019

david-salinas commented Feb 20, 2019

hcc::__activelaneid_u32() doesn't seem to work in HCC 2.0 #1011

hcc::__activelaneid_u32() doesn't seem to work in HCC 2.0 #1011

Comments

dragontamer commented Jan 29, 2019

ex-rzr commented Feb 1, 2019

dragontamer commented Feb 1, 2019 • edited Loading

ex-rzr commented Feb 1, 2019 • edited Loading

dragontamer commented Feb 1, 2019 • edited Loading

david-salinas commented Feb 1, 2019

david-salinas commented Feb 20, 2019

dragontamer commented Feb 1, 2019 •

edited

Loading

ex-rzr commented Feb 1, 2019 •

edited

Loading

dragontamer commented Feb 1, 2019 •

edited

Loading