Primary informations
Username: calvogon
Cluster: baobab
Description
The gpu:0 of gpu022 seems to have a hardware issue.
Steps to Reproduce
With nvidia-smi it can be seen that the GPU 0 has an ERR! (in the power usage column).
±----------------------------------------------------------------------------------------+| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 ||-----------------------------------------±-----------------------±---------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. || | | MIG M. ||=========================================+========================+======================|| 0 NVIDIA A100-PCIE-40GB On | 00000000:01:00.0 Off | 0 || N/A 39C P0 ERR! / 250W | 0MiB / 40960MiB | 0% Default || | | Disabled |±----------------------------------------±-----------------------±---------------------+| 1 NVIDIA A100-PCIE-40GB On | 00000000:22:00.0 Off | 0 || N/A 33C P0 39W / 250W | 0MiB / 40960MiB | 0% Default || | | Disabled |±----------------------------------------±-----------------------±---------------------+| 2 NVIDIA A100-PCIE-40GB On | 00000000:41:00.0 Off | 0 || N/A 34C P0 34W / 250W | 0MiB / 40960MiB | 0% Default || | | Disabled |±----------------------------------------±-----------------------±---------------------+| 3 NVIDIA A100-PCIE-40GB On | 00000000:61:00.0 Off | 0 || N/A 34C P0 37W / 250W | 0MiB / 40960MiB | 0% Default || | | Disabled |±----------------------------------------±-----------------------±---------------------+| 4 NVIDIA A100-PCIE-40GB On | 00000000:81:00.0 Off | 0 || N/A 33C P0 35W / 250W | 0MiB / 40960MiB | 0% Default || | | Disabled |±----------------------------------------±-----------------------±---------------------+| 5 NVIDIA A100-PCIE-40GB On | 00000000:A1:00.0 Off | 0 || N/A 33C P0 35W / 250W | 0MiB / 40960MiB | 0% Default || | | Disabled |±----------------------------------------±-----------------------±---------------------+| 6 NVIDIA A100-PCIE-40GB On | 00000000:C1:00.0 Off | 0 || N/A 32C P0 32W / 250W | 0MiB / 40960MiB | 0% Default || | | Disabled |±----------------------------------------±-----------------------±---------------------+| 7 NVIDIA A100-PCIE-40GB On | 00000000:E1:00.0 Off | 0 || N/A 32C P0 34W / 250W | 0MiB / 40960MiB | 0% Default || | | Disabled |±----------------------------------------±-----------------------±---------------------+
±----------------------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=========================================================================================|| No running processes found |±----------------------------------------------------------------------------------------+
When I try to access the GPU, the CUDA driver throws an exception saying that the CUDA driver was unable to create a stream.
Python 3.13.5 (main, Jul 1 2025, 18:37:36) [Clang 20.1.4 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> jax.device_count()
2025-09-01 18:38:59.204949: W external/xla/xla/service/platform_util.cc:220] unable to create StreamExecutor for CUDA:0: CUDA error: : CUDA_ERROR_UNKNOWN: unknown error
Traceback (most recent call last):
File "/home/users/c/calvogon/dino-jax/.venv/lib/python3.13/site-packages/jax/_src/xla_bridge.py", line 820, in backends
backend = _init_backend(platform)
File "/home/users/c/calvogon/dino-jax/.venv/lib/python3.13/site-packages/jax/_src/xla_bridge.py", line 904, in _init_backend
backend = registration.factory()
File "/home/users/c/calvogon/dino-jax/.venv/lib/python3.13/site-packages/jax/_src/xla_bridge.py", line 576, in factory
return xla_client.make_c_api_client(plugin_name, updated_options, None)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/users/c/calvogon/dino-jax/.venv/lib/python3.13/site-packages/jaxlib/xla_client.py", line 156, in make_c_api_client
return _xla.get_c_api_client(
~~~~~~~~~~~~~~~~~~~~~^
plugin_name,
^^^^^^^^^^^^
...<2 lines>...
transfer_server_factory,
^^^^^^^^^^^^^^^^^^^^^^^^
)
^
jaxlib._jax.XlaRuntimeError: INTERNAL: no supported devices found for platform CUDA
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<python-input-1>", line 1, in <module>
jax.device_count()
~~~~~~~~~~~~~~~~^^
File "/home/users/c/calvogon/dino-jax/.venv/lib/python3.13/site-packages/jax/_src/xla_bridge.py", line 983, in device_count
return int(get_backend(backend).device_count())
~~~~~~~~~~~^^^^^^^^^
File "/home/users/c/calvogon/dino-jax/.venv/lib/python3.13/site-packages/jax/_src/util.py", line 298, in wrapper
return cached(config.trace_context() if trace_context_in_key else _ignore(),
*args, **kwargs)
File "/home/users/c/calvogon/dino-jax/.venv/lib/python3.13/site-packages/jax/_src/util.py", line 292, in cached
return f(*args, **kwargs)
File "/home/users/c/calvogon/dino-jax/.venv/lib/python3.13/site-packages/jax/_src/xla_bridge.py", line 952, in get_backend
return _get_backend_uncached(platform)
File "/home/users/c/calvogon/dino-jax/.venv/lib/python3.13/site-packages/jax/_src/xla_bridge.py", line 931, in _get_backend_uncached
bs = backends()
File "/home/users/c/calvogon/dino-jax/.venv/lib/python3.13/site-packages/jax/_src/xla_bridge.py", line 836, in backends
raise RuntimeError(err_msg)
RuntimeError: Unable to initialize backend 'cuda': INTERNAL: no supported devices found for platform CUDA (you may need to uninstall the failing plugin package, or set JAX_PLATFORMS=cpu to skip this backend.)
