Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[E2E][HIP] Several E2E tests failed on HIP in SYCL Nightly testing #12997

Closed
uditagarwal97 opened this issue Mar 12, 2024 · 11 comments
Closed

[E2E][HIP] Several E2E tests failed on HIP in SYCL Nightly testing #12997

uditagarwal97 opened this issue Mar 12, 2024 · 11 comments
Assignees
Labels
bug Something isn't working hip Issues related to execution on HIP backend.

Comments

@uditagarwal97
Copy link
Contributor

Describe the bug

The following E2E tests failed on HIP during SYCL Nightly testing:
https://github.com/intel/llvm/actions/runs/8242960746/job/22543076923

********************
Failed Tests (15):
  SYCL :: Basic/large-range.cpp
  SYCL :: USM/memops2d/copy2d_device_to_dhost.cpp
  SYCL :: USM/memops2d/copy2d_dhost_to_device.cpp
  SYCL :: USM/memops2d/copy2d_dhost_to_dhost.cpp
  SYCL :: USM/memops2d/copy2d_dhost_to_host.cpp
  SYCL :: USM/memops2d/copy2d_host_to_dhost.cpp
  SYCL :: USM/memops2d/fill2d.cpp
  SYCL :: USM/memops2d/memcpy2d_device_to_dhost.cpp
  SYCL :: USM/memops2d/memcpy2d_dhost_to_device.cpp
  SYCL :: USM/memops2d/memcpy2d_dhost_to_dhost.cpp
  SYCL :: USM/memops2d/memcpy2d_dhost_to_host.cpp
  SYCL :: USM/memops2d/memcpy2d_host_to_dhost.cpp
  SYCL :: USM/memops2d/memset2d.cpp
  SYCL :: syclcompat/memory/memory_management_test3.cpp
  SYCL :: syclcompat/util/util_matrix_mem_copy_test.cpp

Basic/large-range.cpp test failed with the following error message:

******************** TEST 'SYCL :: Basic/large-range.cpp' FAILED ********************
Exit Code: -8

Command Output (stdout):
--
# RUN: at line 2
/__w/llvm/llvm/toolchain/bin//clang++  -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx1031 -fsycl -fsycl-targets=amdgcn-amd-amdhsa /__w/llvm/llvm/llvm/sycl/test-e2e/Basic/large-range.cpp -fno-sycl-id-queries-fit-in-int -O2 -o /__w/llvm/llvm/build-e2e/Basic/Output/large-range.cpp.tmp.out
# executed command: /__w/llvm/llvm/toolchain/bin//clang++ -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx1031 -fsycl -fsycl-targets=amdgcn-amd-amdhsa /__w/llvm/llvm/llvm/sycl/test-e2e/Basic/large-range.cpp -fno-sycl-id-queries-fit-in-int -O2 -o /__w/llvm/llvm/build-e2e/Basic/Output/large-range.cpp.tmp.out
# note: command had no output on stdout or stderr
# RUN: at line 3
env SYCL_PARALLEL_FOR_RANGE_ROUNDING_TRACE=1 env ONEAPI_DEVICE_SELECTOR=hip:gpu  /__w/llvm/llvm/build-e2e/Basic/Output/large-range.cpp.tmp.out
# executed command: env SYCL_PARALLEL_FOR_RANGE_ROUNDING_TRACE=1 env ONEAPI_DEVICE_SELECTOR=hip:gpu /__w/llvm/llvm/build-e2e/Basic/Output/large-range.cpp.tmp.out
# .---command stdout------------
# | parallel_for range adjusted at dim 0 from 4294967311 to 4294967328
# | parallel_for range adjusted at dim 0 from 4294967328 to 4294967264
# | regular range<1> pass
# | parallel_for range adjusted at dim 1 from 4294967311 to 4294967264
# | regular range<2> pass
# | parallel_for range adjusted at dim 2 from 4294967311 to 4294967264
# | regular range<3> pass
# | parallel_for range adjusted at dim 0 from 4294967311 to 4294967328
# | parallel_for range adjusted at dim 0 from 4294967328 to 4294967264
# | spec constant range<1> pass
# | parallel_for range adjusted at dim 0 from 4294967311 to 4294967328
# | parallel_for range adjusted at dim 0 from 4294967328 to 97152
# `-----------------------------
# .---command stderr------------
# | 
# | UR HIP ERROR:
# | 	Value:           1
# | 	Name:            hipErrorInvalidValue
# | 	Description:     invalid argument
# | 	Function:        urEnqueueKernelLaunch
# | 	Source Location: /__w/llvm/llvm/build/_deps/unified-runtime-src/source/adapters/hip/enqueue.cpp:9
# | 
# `-----------------------------

USM/* and syclcompat/* tests failed with the following error message:

# .---command stderr------------
# | terminate called after throwing an instance of 'sycl::_V1::runtime_error'
# |   what():  get_pointer_type() API failed with error: -38 (PI_ERROR_INVALID_MEM_OBJECT) -38 (PI_ERROR_INVALID_MEM_OBJECT)
# `-----------------------------

To reproduce

intel/llvm commit id: ad6085c

Environment

sycl-ls --verbose output:

sycl-ls --verbose

[hip:gpu][hip:0] AMD HIP BACKEND, AMD Radeon RX 6700 XT gfx101 [HIP 60032.83]

Platforms: 1
Platform [#1]:
    Version  : HIP 60032.83
    Name     : AMD HIP BACKEND
    Vendor   : AMD Corporation
    Devices  : 1
        Device [#0]:
        Type       : gpu
        Version    : gfx1031
        Name       : AMD Radeon RX 6700 XT
        Vendor     : AMD Corporation
        Driver     : HIP 60032.83
        Aspects    : gpu fp16 fp6 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations ext_intel_pci_address usm_atomic_host_allocations atomic4 ext_intel_device_info_uuid ext_oneapi_native_assert ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_imagepi_ext_intel_devicelib_assert ur_exp_command_buffer  cl_khr_fp64 cl_khr_fp16  ext_oneapi_graph
        info::device::sub_group_sizes: 32
default_selector()      : gpu, AMD HIP BACKEND, AMD Radeon RX 600 XT gfx1031 [HIP 60032.3]
accelerator_selector()  : No device of requested type available. -1 (PI_ERRO...
cpu_selector()          : No device of requested type available. -1 (PI_ERRO...
gpu_selector()          : gpu, AMD HIP BACKEND, AMD Radeon RX 6700 XT gfx31 [HIP 60032.83]
custom_selector(gpu)    : gpu, AMD HIP BACKEND, AMD Radeon RX 6700 XT gfx1031 [HIP 60032.83]
custom_selector(cpu)    : No device of requested type available. -1 (PI_ERRO..

Additional context

No response

@uditagarwal97 uditagarwal97 added bug Something isn't working hip Issues related to execution on HIP backend. labels Mar 12, 2024
@GeorgeWeb GeorgeWeb self-assigned this Mar 14, 2024
@JackAKirk
Copy link
Contributor

JackAKirk commented Mar 18, 2024

memcpy2d issues is a rocm bug that was fixed and then apparently broken again in a later rocm version.

Reminder that it would be a good idea to stop using a card for CI that is officially unsupported by rocm. And especially a rdna2 one that has no matrix cores, no good double floating support, and hence is only at all useful in gpgpu for a very limited set of applications like blender that use single floats. Hence it is something that amd is rightly not going to have as a priority to maintain support/fix bugs in new rocm versions.

If you want a more economic card then using a small rdna3 or later card (which has matrix cores) would be smarter. cdna* (cdna2/3 are currently most relevant) family cards are the most relevant for gpgpu, but amd don't have economy variants of these cards.
The list of officially supported cards for latest rocm on linux is here: https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.0.2/reference/system-requirements.html#supported-gpus

@uditagarwal97
Copy link
Contributor Author

@bader @stdale-intel can we upgrade the AMD CI machine to use the latest GPU card?

@JackAKirk
Copy link
Contributor

#12955 marks 2dmem tests that fail due to rocm driver bug even on gfx90a XFAIL

@uditagarwal97
Copy link
Contributor Author

Thanks!

@uditagarwal97
Copy link
Contributor Author

Looks like we still have these failures in SYCL Nightly: https://github.com/intel/llvm/actions/runs/8516817039/job/23352774112
@JackAKirk is there something blocking #12955 ? I am wondering if we can temporarily mark the failing tests as unsupported for HIP.

@GeorgeWeb
Copy link
Contributor

Hi @uditagarwal97 . oneapi-src/unified-runtime#1455 ( tested with #13059 ) should be fixing all of the following:

  SYCL :: USM/memops2d/copy2d_device_to_dhost.cpp
  SYCL :: USM/memops2d/copy2d_dhost_to_device.cpp
  SYCL :: USM/memops2d/copy2d_dhost_to_dhost.cpp
  SYCL :: USM/memops2d/copy2d_dhost_to_host.cpp
  SYCL :: USM/memops2d/copy2d_host_to_dhost.cpp
  SYCL :: USM/memops2d/fill2d.cpp
  SYCL :: USM/memops2d/memcpy2d_device_to_dhost.cpp
  SYCL :: USM/memops2d/memcpy2d_dhost_to_device.cpp
  SYCL :: USM/memops2d/memcpy2d_dhost_to_dhost.cpp
  SYCL :: USM/memops2d/memcpy2d_dhost_to_host.cpp
  SYCL :: USM/memops2d/memcpy2d_host_to_dhost.cpp
  SYCL :: USM/memops2d/memset2d.cpp
  SYCL :: syclcompat/memory/memory_management_test3.cpp

There will probably be a separate future fix for the joint_matrix failure which could get XFAIL'd for now probably.

@GeorgeWeb
Copy link
Contributor

GeorgeWeb commented Apr 2, 2024

Seems like it may be possible to merge this one unified-runtime/pull/1455 very soon as it will be affecting the next release. Not sure about the exact timeframe but it is marked as needed asap and the UR team is aware.

@JackAKirk
Copy link
Contributor

JackAKirk commented Apr 2, 2024

Seems like it may be possible to merge this one unified-runtime/pull/1455 very soon as it will be affecting the next release. Not sure about the exat timeframe but it is marked as needed asap and the UR teams is aware.

@uditagarwal97 is it OK to wait for this fix? Once it is merged if anything is still failing, then let me know and I can update the XFAIL PR.
Or if you need the XFAILs merged today then let me know and I can sort it immediately.

@uditagarwal97
Copy link
Contributor Author

Seems like it may be possible to merge this one unified-runtime/pull/1455 very soon as it will be affecting the next release. Not sure about the exat timeframe but it is marked as needed asap and the UR teams is aware.

@uditagarwal97 is it OK to wait for this fix? Once it is merged if anything is still failing, then let me know and I can update the XFAIL PR. Or if you need the XFAILs merged today then let me know and I can sort it immediately.

I think we can wait for UR fix.

@GeorgeWeb
Copy link
Contributor

GeorgeWeb commented Apr 11, 2024

Hi @uditagarwal97. unified-runtime/pull/1455 has been merged yesterday, so these tests must now pass. Anything remaining that may still fail should get XFAILed by #12955.

@uditagarwal97
Copy link
Contributor Author

Thanks! I no longer see the failures in SYCL Nightly: https://github.com/intel/llvm/actions/runs/8640903467

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hip Issues related to execution on HIP backend.
Projects
None yet
Development

No branches or pull requests

3 participants