Hi,
Username: micheliv
Cluster: baobab
Subject: jobs on gpu022 are extremely slow
jobid: 6138374, 6175633, 6175632, 6175629, 6175630
For the past week my jobs on gpu022 have been extremely slow, often 4x slower than before or with our own machines. Are you aware of that issue? Any ideas on what is causing this?
Thank you for your time.
Best,
Vincent
Dear @Vincent.Micheli
I had a look at gpu022 while some of your jobs were running. This is what I saw:
Your processes on GPUs:
| 4 N/A N/A 1083615 C python3 28498MiB |
| 5 N/A N/A 1119331 C python3 28532MiB |
| 6 N/A N/A 1453366 C python 25358MiB |
| 7 N/A N/A 1248514 C python3 28498MiB |
GPUs used by your jobs:
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-PCIE-40GB On | 00000000:81:00.0 Off | 0 |
| N/A 48C P0 274W / 250W | 28507MiB / 40960MiB | 68% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-PCIE-40GB On | 00000000:A1:00.0 Off | 0 |
| N/A 39C P0 59W / 250W | 28541MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-PCIE-40GB On | 00000000:C1:00.0 Off | 0 |
| N/A 38C P0 75W / 250W | 25367MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-PCIE-40GB On | 00000000:E1:00.0 Off | 0 |
| N/A 40C P0 179W / 250W | 28507MiB / 40960MiB | 38% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
It seems that two GPUs are doing nothing, two are loading the GPUs but not fully. I’m not sure if your jobs were ending at this time because when I checked again they were finished. The CPUs were almost idle.
Please share your sbatch
script here. We are as well interested by the input and output files (number of files, size per files, location on the filesystem etc).
Do you have the same issue on other GPUs too or only on gpu022.baobab?
Best
Thanks for your message.
Indeed I launched some new jobs this afternoon.
I do not have this issue on two different machines with the same hardware as gpu022.baobab.
Here is the command I use for my interactive sessions: srun --partition=private-mlg-gpu --gpus=1 --cpus-per-task=14 --mem=64000 --time=7-0 --pty $SHELL
And this is the script I run in the irisV2 folder while using the ‘mbrl’ conda env: python3 src/main.py common.device=cuda:0 env.train.id=myenvid wandb.name=myrunname
Looking at GPU power usage and utilization graphs, it looks like that on gpu022.baobab I am getting a third of what I get on the other machines, and this was not the case a week or so ago.
I/O from other jobs or my own jobs might be the issue, but on my side I am not doing anything I/O intensive, or at least it has never been a problem.
Best,
Vincent
After further investigation, it looks like I was encountering the same issue as in this post, and moving everything to /scratch
on the compute node as you suggested solved the issue.
Excellent! The scratch server thanks you