Jobs on gpu022 are extremely slow

Hi,

Username: micheliv
Cluster: baobab
Subject: jobs on gpu022 are extremely slow
jobid: 6138374, 6175633, 6175632, 6175629, 6175630

For the past week my jobs on gpu022 have been extremely slow, often 4x slower than before or with our own machines. Are you aware of that issue? Any ideas on what is causing this?

Thank you for your time.

Best,
Vincent

Dear @Vincent.Micheli

I had a look at gpu022 while some of your jobs were running. This is what I saw:

Your processes on GPUs:

|    4   N/A  N/A   1083615      C   python3                                   28498MiB |
|    5   N/A  N/A   1119331      C   python3                                   28532MiB |
|    6   N/A  N/A   1453366      C   python                                    25358MiB |
|    7   N/A  N/A   1248514      C   python3                                   28498MiB |

GPUs used by your jobs:

+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PCIE-40GB          On  | 00000000:81:00.0 Off |                    0 |
| N/A   48C    P0             274W / 250W |  28507MiB / 40960MiB |     68%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-PCIE-40GB          On  | 00000000:A1:00.0 Off |                    0 |
| N/A   39C    P0              59W / 250W |  28541MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-PCIE-40GB          On  | 00000000:C1:00.0 Off |                    0 |
| N/A   38C    P0              75W / 250W |  25367MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-PCIE-40GB          On  | 00000000:E1:00.0 Off |                    0 |
| N/A   40C    P0             179W / 250W |  28507MiB / 40960MiB |     38%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

It seems that two GPUs are doing nothing, two are loading the GPUs but not fully. I’m not sure if your jobs were ending at this time because when I checked again they were finished. The CPUs were almost idle.

Please share your sbatch script here. We are as well interested by the input and output files (number of files, size per files, location on the filesystem etc).

Do you have the same issue on other GPUs too or only on gpu022.baobab?

Best

Thanks for your message.

Indeed I launched some new jobs this afternoon.

I do not have this issue on two different machines with the same hardware as gpu022.baobab.

Here is the command I use for my interactive sessions: srun --partition=private-mlg-gpu --gpus=1 --cpus-per-task=14 --mem=64000 --time=7-0 --pty $SHELL

And this is the script I run in the irisV2 folder while using the ‘mbrl’ conda env: python3 src/main.py common.device=cuda:0 env.train.id=myenvid wandb.name=myrunname

Looking at GPU power usage and utilization graphs, it looks like that on gpu022.baobab I am getting a third of what I get on the other machines, and this was not the case a week or so ago.

I/O from other jobs or my own jobs might be the issue, but on my side I am not doing anything I/O intensive, or at least it has never been a problem.

Best,
Vincent

After further investigation, it looks like I was encountering the same issue as in this post, and moving everything to /scratch on the compute node as you suggested solved the issue.

Excellent! The scratch server thanks you :stuck_out_tongue_winking_eye: