Pytorch 2.0 tutorial - unexpected behaviour on specific Baobab node

Alban.Bornet · June 20, 2023, 2:50pm

Hello,

I am trying to use pytorch with GPU on Baobab. My lab has access to a private partition (private-teodoro-gpu) with two different nodes of Baobab: gpu034 and gpu035. I believe both nodes have 8 GPUs, and they all are NVIDIA GeForce RTX 3090.

I tried to follow the tutorial for using pytorch with a GPU. It didn’t work as expected and to understand why, I slightly modified the scripts.

First, I specified which node I was using in the sbatch file:

#!/bin/sh

#SBATCH --time=00:01
#SBATCH --partition=private-teodoro-gpu # shared-gpu produces the same behaviour
#SBATCH --nodelist=gpu034 # this is the added line! NOTE: I tried both gpu034 and gpu035
#SBATCH --cpus-per-task=2
#SBATCH --gpus=1

REGISTRY=/opt/cluster/registry
SIF=pytorch_23.05-py3.sif
IMAGE=${REGISTRY}/${SIF}
SCRIPT=pytorch_tensors.py

srun apptainer run --nv ${IMAGE} python ${SCRIPT}

Second, I changed the python script to simply checking nvidia-smi and cuda information, as well as whether cuda is available to pytorch:

import os
import torch

os.system(‘nvcc --version’)
os.system(‘nvidia-smi’)

device_count = torch.cuda.device_count()
print(“torch.cuda.device_count:”, device_count)
device = “cuda” if torch.cuda.is_available() else “cpu”
torch.set_default_device(device)

print(‘\n\n\n#################’)
print(‘Using %s device’ % device)
print(‘#################\n\n\n’)

Then, I tried to run this script through the slurm file, using either the node “gpu034” or “gpu035”, and here is the problem. One node (gpu035) loads the GPU successfully, whereas the other doesn’t (gpu034). The surprising part is that they both have the same GPU device (RTX 3090), nvidia-smi driver version (530.30.02), and CUDA version (12.1). Below I provide the output logs I obtain in both cases.

I don’t understand why one node works, and the other doesn’t. Any help? Thanks!

Here are the logs using node gpu034 (not working):

INFO:    underlay of /etc/localtime required more than 50 (94) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (476) bind mounts
13:4: not a valid test operator: (
13:4: not a valid test operator: 530.30.02

=============
== PyTorch ==
=============

NVIDIA Release 23.05 (build 60708168)
PyTorch Version 2.0.0

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.

ERROR: The NVIDIA Driver is present, but CUDA failed to initialize.  GPU functionality will not be available.
   [[ Unknown error (error 999) ]]

NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
      detected.  Multi-node communication performance may be reduced.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Tue Jun 20 16:34:11 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:01:00.0 Off |                  N/A |
|  0%   26C    P8               26W / 370W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:115: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/pytorch/pytorch/c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
torch.cuda.device_count: 1



#################
Using cpu device
#################

Here are the logs using node gpu035 (working):

INFO:    underlay of /etc/localtime required more than 50 (94) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (476) bind mounts
13:4: not a valid test operator: (
13:4: not a valid test operator: 530.30.02

=============
== PyTorch ==
=============

NVIDIA Release 23.05 (build 60708168)
PyTorch Version 2.0.0

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.

NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
      detected.  Multi-node communication performance may be reduced.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Tue Jun 20 16:34:18 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:01:00.0 Off |                  N/A |
|  0%   27C    P5               39W / 370W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
torch.cuda.device_count: 1



#################
Using cuda device
#################

P.S.: I add an image because the nvida-smi outputs look bad in the quotes.
Edit 2: I mixed up the node indices, I corrected it
Edit 3: I finally found how to paste pre-formatted text in my post!!

Alban.Bornet · June 21, 2023, 2:29pm

I tried again this afternoon exactly with the same code, and now both nodes work perfectly. Not sure what happened!

Adrien.Albert · June 21, 2023, 2:39pm

Hi @Alban.Bornet ,

I have rebooted the node and test, it seems to be working now, could you confirm everything is ok for you ?

(baobab)-[alberta@login2 pytorch]$ sac -j 3863339
          JobID    JobName    Account      User        NodeList   NTasks               Start                 End      State 
--------------- ---------- ---------- --------- --------------- -------- ------------------- ------------------- ---------- 
        3863339  sbatch.sh      burgi   alberta          gpu034          2023-06-21T16:35:42 2023-06-21T16:35:54  COMPLETED

cat slurm-3863339.out
INFO:    underlay of /etc/localtime required more than 50 (94) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (476) bind mounts
13:4: not a valid test operator: (
13:4: not a valid test operator: 530.30.02

=============
== PyTorch ==
=============

NVIDIA Release 23.05 (build 60708168)
PyTorch Version 2.0.0

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2023 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
      detected.  Multi-node communication performance may be reduced.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Wed Jun 21 16:35:50 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:01:00.0 Off |                  N/A |
|  0%   26C    P8               26W / 370W|      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
torch.cuda.device_count: 1



#################
Using cuda device
#################

Best regards,

Alban.Bornet · June 22, 2023, 8:57am

Hello,
I indeed confirm both node are now working properly and using the GPU as expected.
Thank you very much.
Alban