[GPU_X] Unit tests failing with "cudaErrorInvalidDeviceFunction: invalid device function" #46864

iarspider · 2024-12-04T11:08:07Z

Two unit tests - HeterogeneousTest/CUDAKernel/testCudaDeviceAdditionKernel and HeterogeneousTest/CUDAWrapper/testCudaDeviceAdditionWrapper are failing in GPU_X IB since at least CMSSW_15_0_GPU_X_2024-11-27-2300:

  REQUIRE_NOTHROW( cms::cudatest::wrapper_add_vectors_f(in1_d, in2_d, out_d, size) )
due to unexpected exception with message:
  
src/HeterogeneousTest/CUDAWrapper/src/DeviceAdditionWrapper.cu, line 17:
  cudaCheck(cudaGetLastError());
  cudaErrorInvalidDeviceFunction: invalid device function

The text was updated successfully, but these errors were encountered:

iarspider · 2024-12-04T11:08:14Z

assign heterogeneous

cmsbuild · 2024-12-04T11:08:24Z

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2024-12-04T11:08:25Z

cms-bot internal usage

cmsbuild · 2024-12-04T11:08:26Z

A new Issue was created by @iarspider.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard · 2024-12-04T14:37:30Z

on what machines are the tests running ?

iarspider · 2024-12-04T14:53:05Z

Grid node with nVidia gpu:

+ nvidia-smi
Wed Dec  4 00:52:08 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-32GB           Off |   00000000:07:00.0 Off |                    0 |
| N/A   35C    P0             26W /  250W |       3MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
+ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.54.15  Tue Mar  5 22:23:56 UTC 2024
GCC version:  gcc version 11.4.1 20231218 (Red Hat 11.4.1-3) (GCC)

fwyzard · 2024-12-04T14:54:09Z

could you run also cudaComputeCapabilities ?

makortel · 2024-12-04T18:49:43Z

FWIW, the test has succeeded in 14_2_X (at least between 11-27-2300 and 12-03-2300).

iarspider · 2024-12-06T07:52:30Z

@fwyzard

+ nvidia-smi
Fri Dec  6 07:34:37 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100S-PCIE-32GB          Off |   00000000:07:00.0 Off |                    0 |
| N/A   40C    P0             25W /  250W |       3MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
+ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.54.15  Tue Mar  5 22:23:56 UTC 2024
GCC version:  gcc version 11.4.1 20231218 (Red Hat 11.4.1-3) (GCC) 
+ cudaComputeCapabilities
   0     7.0    Tesla V100S-PCIE-32GB

fwyzard · 2024-12-06T08:34:03Z

What's very curious is that the alpaka-based tests all pass in the IBs 🤔

===== Test "testAlpakaDeviceAdditionKernelCudaAsync" ====
===============================================================================
All tests passed (1048577 assertions in 1 test case)


---> test testAlpakaDeviceAdditionKernelCudaAsync succeeded
TestTime:0
^^^^ End Test testAlpakaDeviceAdditionKernelCudaAsync ^^^^
>> Tests for package HeterogeneousTest/AlpakaKernel ran.

fwyzard · 2024-12-06T08:35:36Z

Is there a way to log interactively on a node where the test fails ?
It's hard to debug otherwise :(

smuzaffar · 2024-12-06T08:44:48Z

@fwyzard , you can do the following to login to the grid gpu node ( where a dummy job is running to hold the node).

ssh lxplus
~cmsbuild/public/lxplus
export _CONDOR_SCHEDD_HOST=bigbird21.cern.ch
export _CONDOR_CREDD_HOST=bigbird21.cern.ch
condor_ssh_to_job -auto-retry 487779.0

Node is available for next 20 hours. Once you logged out of this node then it will be deallocated automatically.

fwyzard · 2024-12-06T09:48:46Z

Mhm, it didn't like me, I got kicked out immediately:

lxplus962:~> export _CONDOR_SCHEDD_HOST=bigbird21.cern.ch
lxplus962:~> export _CONDOR_CREDD_HOST=bigbird21.cern.ch
lxplus962:~> condor_ssh_to_job -auto-retry 487779.0
Welcome to [email protected]!
Your condor job is running with pid(s) 3240265 3241835.
b9g47n2106:dir_3240263> Connection to condor-job.b9g47n2106.cern.ch closed by remote host.
Connection to condor-job.b9g47n2106.cern.ch closed.

Can I request a similar slot myself ?

smuzaffar · 2024-12-06T10:02:50Z

yes, just use condor to request a gpu resource

smuzaffar · 2024-12-06T10:04:19Z

add the following in the condor job to get gpu

request_GPUs = 1
Requirements = (TARGET.OpSysAndVer =?= "AlmaLinux9")

fwyzard · 2024-12-06T10:57:05Z

OK, I can reproduce the problem.

cmsbuild added pending-signatures heterogeneous-pending labels Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU_X] Unit tests failing with "cudaErrorInvalidDeviceFunction: invalid device function" #46864

[GPU_X] Unit tests failing with "cudaErrorInvalidDeviceFunction: invalid device function" #46864

iarspider commented Dec 4, 2024

iarspider commented Dec 4, 2024

cmsbuild commented Dec 4, 2024

cmsbuild commented Dec 4, 2024

cmsbuild commented Dec 4, 2024

fwyzard commented Dec 4, 2024

iarspider commented Dec 4, 2024

fwyzard commented Dec 4, 2024

makortel commented Dec 4, 2024

iarspider commented Dec 6, 2024

fwyzard commented Dec 6, 2024

fwyzard commented Dec 6, 2024

smuzaffar commented Dec 6, 2024

fwyzard commented Dec 6, 2024

smuzaffar commented Dec 6, 2024

smuzaffar commented Dec 6, 2024

fwyzard commented Dec 6, 2024

[GPU_X] Unit tests failing with "cudaErrorInvalidDeviceFunction: invalid device function" #46864

[GPU_X] Unit tests failing with "cudaErrorInvalidDeviceFunction: invalid device function" #46864

Comments

iarspider commented Dec 4, 2024

iarspider commented Dec 4, 2024

cmsbuild commented Dec 4, 2024

cmsbuild commented Dec 4, 2024

cmsbuild commented Dec 4, 2024

fwyzard commented Dec 4, 2024

iarspider commented Dec 4, 2024

fwyzard commented Dec 4, 2024

makortel commented Dec 4, 2024

iarspider commented Dec 6, 2024

fwyzard commented Dec 6, 2024

fwyzard commented Dec 6, 2024

smuzaffar commented Dec 6, 2024

fwyzard commented Dec 6, 2024

smuzaffar commented Dec 6, 2024

smuzaffar commented Dec 6, 2024

fwyzard commented Dec 6, 2024