Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_SYSTEM_NOT_READY: system not yet initialized #24866

Open
carlosgmartin opened this issue Nov 12, 2024 · 1 comment
Open

CUDA_ERROR_SYSTEM_NOT_READY: system not yet initialized #24866

carlosgmartin opened this issue Nov 12, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@carlosgmartin
Copy link
Contributor

Description

I used

python3 -m pip install --upgrade "jax[cuda12]"

to install JAX on a GPU node, but am getting a CUDA_ERROR_SYSTEM_NOT_READY error:

(base) $ python3 -c "import jax; jax.numpy.array(0)"
2024-11-12 15:37:25.005059: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: CUDA_ERROR_SYSTEM_NOT_READY: system not yet initialized
Traceback (most recent call last):
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 896, in backends
    backend = _init_backend(platform)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 982, in _init_backend
    backend = registration.factory()
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 674, in factory
    return xla_client.make_c_api_client(plugin_name, updated_options, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jaxlib/xla_client.py", line 200, in make_c_api_client
    return _xla.get_c_api_client(plugin_name, options, distributed_client)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: FAILED_PRECONDITION: No visible GPU devices.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/numpy/lax_numpy.py", line 5426, in array
    out_array: Array = lax_internal._convert_element_type(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/lax/lax.py", line 587, in _convert_element_type
    return convert_element_type_p.bind(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/lax/lax.py", line 2981, in _convert_element_type_bind
    operand = core.Primitive.bind(convert_element_type_p, operand,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/core.py", line 438, in bind
    return self.bind_with_trace(find_top_trace(args), args, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/core.py", line 442, in bind_with_trace
    out = trace.process_primitive(self, map(trace.full_raise, args), params)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/core.py", line 955, in process_primitive
    return primitive.impl(*tracers, **params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/dispatch.py", line 91, in apply_primitive
    outs = fun(*args)
           ^^^^^^^^^^
RuntimeError: Unable to initialize backend 'cuda': FAILED_PRECONDITION: No visible GPU devices. (you may need to uninstall the failing plugin package, or set JAX_PLATFORMS=cpu to skip this backend.)
--------------------
For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.

Here's some additional output:

(base) $ echo $CUDA_VISIBLE_DEVICES
0
(base) $ echo $LD_LIBRARY_PATH

(base) $ nvidia-smi
Tue Nov 12 15:38:53 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:07:00.0 Off |                    0 |
| N/A   25C    P0             43W /  400W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

System info (python version, jaxlib version, accelerator, etc.)

2024-11-12 15:36:48.401160: E external/xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: CUDA_ERROR_SYSTEM_NOT_READY: system not yet initialized
Traceback (most recent call last):
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 896, in backends
    backend = _init_backend(platform)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 982, in _init_backend
    backend = registration.factory()
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 674, in factory
    return xla_client.make_c_api_client(plugin_name, updated_options, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jaxlib/xla_client.py", line 200, in make_c_api_client
    return _xla.get_c_api_client(plugin_name, options, distributed_client)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: FAILED_PRECONDITION: No visible GPU devices.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/environment_info.py", line 49, in print_environment_info
    device info: {xb.devices()[0].device_kind}-{xb.device_count()}, {xb.local_device_count()} local devices"
                  ^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 1094, in devices
    return get_backend(backend).devices()
           ^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 1028, in get_backend
    return _get_backend_uncached(platform)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 1007, in _get_backend_uncached
    bs = backends()
         ^^^^^^^^^^
  File "/marvel/home/cgmartin/miniforge3/lib/python3.11/site-packages/jax/_src/xla_bridge.py", line 912, in backends
    raise RuntimeError(err_msg)
RuntimeError: Unable to initialize backend 'cuda': FAILED_PRECONDITION: No visible GPU devices. (you may need to uninstall the failing plugin package, or set JAX_PLATFORMS=cpu to skip this backend.)
@carlosgmartin carlosgmartin added the bug Something isn't working label Nov 12, 2024
@carlosgmartin
Copy link
Contributor Author

carlosgmartin commented Nov 13, 2024

I also tried uninstalling all existing nvidia* and jax-cuda* packages before re-installing JAX:

$ python3 -m pip freeze --all | grep -e nvidia -e jax-cuda | xargs python3 -m pip uninstall -y jax jaxlib
...
$ conda list | awk '{ print $1 }' | grep -e nvidia -e jax-cuda | xargs conda remove -y jax jaxlib
...
$ mamba list | awk '{ print $1 }' | grep -e nvidia -e jax-cuda | xargs mamba remove -y jax jaxlib
...
$ python3 -m pip install --upgrade "jax[cuda12]"
...
$ echo $CUDA_VISIBLE_DEVICES
0,1,2,3,4,5,6,7
$ echo $LD_LIBRARY_PATH

$ python3 -c "import jax; jax.numpy.array(0)"
RuntimeError: Unable to initialize backend 'cuda': FAILED_PRECONDITION: No visible GPU devices. (you may need to uninstall the failing plugin package, or set JAX_PLATFORMS=cpu to skip this backend.)

but still get the same error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant