Hello,
I am trying to use pytorch with GPU on Baobab. My lab has access to a private partition (private-teodoro-gpu) with two different nodes of Baobab: gpu034 and gpu035. I believe both nodes have 8 GPUs, and they all are NVIDIA GeForce RTX 3090.
I tried to follow the tutorial for using pytorch with a GPU. It didn’t work as expected and to understand why, I slightly modified the scripts.
First, I specified which node I was using in the sbatch file:
#!/bin/sh
#SBATCH --time=00:01
#SBATCH --partition=private-teodoro-gpu # shared-gpu produces the same behaviour
#SBATCH --nodelist=gpu034 # this is the added line! NOTE: I tried both gpu034 and gpu035
#SBATCH --cpus-per-task=2
#SBATCH --gpus=1REGISTRY=/opt/cluster/registry
SIF=pytorch_23.05-py3.sif
IMAGE=${REGISTRY}/${SIF}
SCRIPT=pytorch_tensors.pysrun apptainer run --nv ${IMAGE} python ${SCRIPT}
Second, I changed the python script to simply checking nvidia-smi and cuda information, as well as whether cuda is available to pytorch:
import os
import torchos.system(‘nvcc --version’)
os.system(‘nvidia-smi’)device_count = torch.cuda.device_count()
print(“torch.cuda.device_count:”, device_count)
device = “cuda” if torch.cuda.is_available() else “cpu”
torch.set_default_device(device)print(‘\n\n\n#################’)
print(‘Using %s device’ % device)
print(‘#################\n\n\n’)
Then, I tried to run this script through the slurm file, using either the node “gpu034” or “gpu035”, and here is the problem. One node (gpu035) loads the GPU successfully, whereas the other doesn’t (gpu034). The surprising part is that they both have the same GPU device (RTX 3090), nvidia-smi driver version (530.30.02), and CUDA version (12.1). Below I provide the output logs I obtain in both cases.
I don’t understand why one node works, and the other doesn’t. Any help? Thanks!
Here are the logs using node gpu034 (not working):
INFO: underlay of /etc/localtime required more than 50 (94) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (476) bind mounts
13:4: not a valid test operator: (
13:4: not a valid test operator: 530.30.02
=============
== PyTorch ==
=============
NVIDIA Release 23.05 (build 60708168)
PyTorch Version 2.0.0
Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available.
[[ Unknown error (error 999) ]]
NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
detected. Multi-node communication performance may be reduced.
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Tue Jun 20 16:34:11 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 0% 26C P8 26W / 370W| 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:115: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
torch.cuda.device_count: 1
#################
Using cpu device
#################
Here are the logs using node gpu035 (working):
INFO: underlay of /etc/localtime required more than 50 (94) bind mounts
INFO: underlay of /usr/bin/nvidia-smi required more than 50 (476) bind mounts
13:4: not a valid test operator: (
13:4: not a valid test operator: 530.30.02
=============
== PyTorch ==
=============
NVIDIA Release 23.05 (build 60708168)
PyTorch Version 2.0.0
Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
detected. Multi-node communication performance may be reduced.
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Tue Jun 20 16:34:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 0% 27C P5 39W / 370W| 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
torch.cuda.device_count: 1
#################
Using cuda device
#################
P.S.: I add an image because the nvida-smi outputs look bad in the quotes.
Edit 2: I mixed up the node indices, I corrected it
Edit 3: I finally found how to paste pre-formatted text in my post!!