Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU_X] Unit tests failing with "cudaErrorInvalidDeviceFunction: invalid device function" #46864

Open
iarspider opened this issue Dec 4, 2024 · 16 comments

Comments

@iarspider
Copy link
Contributor

Two unit tests - HeterogeneousTest/CUDAKernel/testCudaDeviceAdditionKernel and HeterogeneousTest/CUDAWrapper/testCudaDeviceAdditionWrapper are failing in GPU_X IB since at least CMSSW_15_0_GPU_X_2024-11-27-2300:

  REQUIRE_NOTHROW( cms::cudatest::wrapper_add_vectors_f(in1_d, in2_d, out_d, size) )
due to unexpected exception with message:
  
src/HeterogeneousTest/CUDAWrapper/src/DeviceAdditionWrapper.cu, line 17:
  cudaCheck(cudaGetLastError());
  cudaErrorInvalidDeviceFunction: invalid device function
@iarspider
Copy link
Contributor Author

assign heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2024

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 4, 2024

A new Issue was created by @iarspider.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented Dec 4, 2024

on what machines are the tests running ?

@iarspider
Copy link
Contributor Author

Grid node with nVidia gpu:

+ nvidia-smi
Wed Dec  4 00:52:08 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-PCIE-32GB           Off |   00000000:07:00.0 Off |                    0 |
| N/A   35C    P0             26W /  250W |       3MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
+ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.54.15  Tue Mar  5 22:23:56 UTC 2024
GCC version:  gcc version 11.4.1 20231218 (Red Hat 11.4.1-3) (GCC) 

@fwyzard
Copy link
Contributor

fwyzard commented Dec 4, 2024

could you run also cudaComputeCapabilities ?

@makortel
Copy link
Contributor

makortel commented Dec 4, 2024

FWIW, the test has succeeded in 14_2_X (at least between 11-27-2300 and 12-03-2300).

@iarspider
Copy link
Contributor Author

@fwyzard

+ nvidia-smi
Fri Dec  6 07:34:37 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100S-PCIE-32GB          Off |   00000000:07:00.0 Off |                    0 |
| N/A   40C    P0             25W /  250W |       3MiB /  32768MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
+ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.54.15  Tue Mar  5 22:23:56 UTC 2024
GCC version:  gcc version 11.4.1 20231218 (Red Hat 11.4.1-3) (GCC) 
+ cudaComputeCapabilities
   0     7.0    Tesla V100S-PCIE-32GB

@fwyzard
Copy link
Contributor

fwyzard commented Dec 6, 2024

What's very curious is that the alpaka-based tests all pass in the IBs 🤔

===== Test "testAlpakaDeviceAdditionKernelCudaAsync" ====
===============================================================================
All tests passed (1048577 assertions in 1 test case)


---> test testAlpakaDeviceAdditionKernelCudaAsync succeeded
TestTime:0
^^^^ End Test testAlpakaDeviceAdditionKernelCudaAsync ^^^^
>> Tests for package HeterogeneousTest/AlpakaKernel ran.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 6, 2024

Is there a way to log interactively on a node where the test fails ?
It's hard to debug otherwise :(

@smuzaffar
Copy link
Contributor

@fwyzard , you can do the following to login to the grid gpu node ( where a dummy job is running to hold the node).

ssh lxplus
~cmsbuild/public/lxplus
export _CONDOR_SCHEDD_HOST=bigbird21.cern.ch
export _CONDOR_CREDD_HOST=bigbird21.cern.ch
condor_ssh_to_job -auto-retry 487779.0

Node is available for next 20 hours. Once you logged out of this node then it will be deallocated automatically.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 6, 2024

Mhm, it didn't like me, I got kicked out immediately:

lxplus962:~> export _CONDOR_SCHEDD_HOST=bigbird21.cern.ch
lxplus962:~> export _CONDOR_CREDD_HOST=bigbird21.cern.ch
lxplus962:~> condor_ssh_to_job -auto-retry 487779.0
Welcome to [email protected]!
Your condor job is running with pid(s) 3240265 3241835.
b9g47n2106:dir_3240263> Connection to condor-job.b9g47n2106.cern.ch closed by remote host.
Connection to condor-job.b9g47n2106.cern.ch closed.

Can I request a similar slot myself ?

@smuzaffar
Copy link
Contributor

yes, just use condor to request a gpu resource

@smuzaffar
Copy link
Contributor

add the following in the condor job to get gpu

request_GPUs = 1
Requirements = (TARGET.OpSysAndVer =?= "AlmaLinux9")

@fwyzard
Copy link
Contributor

fwyzard commented Dec 6, 2024

OK, I can reproduce the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants