Don't manage to have only one GPU in interactive session

Primary informations

Username: mongin
Cluster: baobab

Description

I would like to perform tests with a specific GPU on baobab.
I manage, thanks to @Yann.Sagon (How to call a specific GPU - #5 by Denis.Mongin) to ask for a specific GPU, but my problem is I don’t get the number of GPU I want, I get too many GPU.

Here is the command I use:

salloc --gpus=1 --partition=shared-gpu  --time=01:00:00 --constraint=COMPUTE_MODEL_RTX_A6000_48G  --mem=20G --ntasks=1 --cpus-per-task=1

I ask here for 1 GPU.
But when checking the memory:

>>> get_all_mem_info()
2025-06-25 10h-35m-47s : GPU NVIDIA RTX A6000
2025-06-25 10h-35m-47s : memory info: used 7.69G; total 51.16G; free 43.47G
2025-06-25 10h-35m-47s : GPU NVIDIA RTX A6000
2025-06-25 10h-35m-47s : memory info: used 9.06G; total 51.16G; free 42.1G
2025-06-25 10h-35m-47s : GPU NVIDIA RTX A6000
2025-06-25 10h-35m-47s : memory info: used 9.06G; total 51.16G; free 42.1G
2025-06-25 10h-35m-47s : GPU NVIDIA RTX A6000
2025-06-25 10h-35m-47s : memory info: used 9.06G; total 51.16G; free 42.1G
2025-06-25 10h-35m-47s : GPU NVIDIA RTX A6000
2025-06-25 10h-35m-47s : memory info: used 9.06G; total 51.16G; free 42.1G
2025-06-25 10h-35m-47s : GPU NVIDIA RTX A6000
2025-06-25 10h-35m-47s : memory info: used 9.06G; total 51.16G; free 42.1G
2025-06-25 10h-35m-47s : GPU NVIDIA RTX A6000
2025-06-25 10h-35m-47s : memory info: used 9.06G; total 51.16G; free 42.1G
2025-06-25 10h-35m-47s : GPU NVIDIA RTX A6000
2025-06-25 10h-35m-47s : memory info: used 5.74G; total 51.16G; free 45.42G

I get the 8 GPU of the cluster.
How do I restrict so that I only get one, to mimic a situation where I only have access to one GPU ?

Hi,

can you please share the code you do in python to have this output so I can reproduce the issue?

def log_mess(message):
   plouf = datetime.datetime.now() 
   print(plouf.strftime("%Y-%m-%d %Hh-%Mm-%Ss") + " : " + message)


def get_mem_info(i):
  meminfo = torch.cuda.mem_get_info(i)
  free = np.round(meminfo[0]/1e9,2)
  total = np.round(meminfo[1]/1e9,2)
  used = np.round(total - free,2)
  log_mess("memory info: used " + str(used) + "G; total " + str(total)+ "G; free " + str(free) + "G")

# print devices

def get_all_mem_info():
   for i in range(torch.cuda.device_count()):
      log_mess("GPU " + torch.cuda.get_device_name(i))
      get_mem_info(i)

Hi, thanks for the code, in the meantime we figured out what the issue is, we are fixing it right now.

1 Like

I just tested using sbatch, and I don’t have the problem in this situation:

something like:


#SBATCH --job-name simple_load
#SBATCH --time=05:00:00
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH --partition=shared-gpu
#SBATCH --constraint=COMPUTE_MODEL_RTX_A6000_48G
#SBATCH --mem=20G
#SBATCH --array=1

gives only one single RTX A6000, using the same python function.

The issue is when you request the resources and connect to the compute node using ssh.

Thanks for the reactivity, you are the best

yes, exactly, when I am using interactive session with ssh, it looks like I have access to all the GPU of the node, not only to the ressources I asked

The same happens using nvidia-smi command:


(baobab)-[mongin@gpu048 ~]$ nvidia-smi
Wed Jun 25 12:48:57 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               On  |   00000000:01:00.0 Off |                  Off |
| 30%   29C    P8             22W /  300W |       1MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A6000               On  |   00000000:22:00.0 Off |                  Off |
| 30%   27C    P8             28W /  300W |       1MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A6000               On  |   00000000:41:00.0 Off |                  Off |
| 31%   60C    P2            191W /  300W |   40738MiB /  49140MiB |     51%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A6000               On  |   00000000:61:00.0 Off |                  Off |
| 30%   27C    P8             18W /  300W |       1MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A6000               On  |   00000000:81:00.0 Off |                  Off |
| 30%   28C    P8             19W /  300W |       1MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA RTX A6000               On  |   00000000:A1:00.0 Off |                  Off |
| 30%   26C    P8             24W /  300W |       1MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA RTX A6000               On  |   00000000:C1:00.0 Off |                  Off |
| 30%   28C    P8             24W /  300W |       1MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA RTX A6000               On  |   00000000:E1:00.0 Off |                  Off |
| 30%   27C    P8             23W /  300W |    1108MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    2   N/A  N/A         1699106      C   ...ab_python_env_LLM3/bin/python      40730MiB |
|    7   N/A  N/A         1672598      C   python3                                1098MiB |
+-----------------------------------------------------------------------------------------+

This has been fixed in the three clusters, BUT if a user requested a GPU before the fix and connected to the compute node using SSH, and is using a different GPU to the one allocated (a lot of “and”), this could cause an issue until their job finishes. Unfortunately, it is not easy to detect this situation.

1 Like