Hi, I’ve been trying to test the large memory GPU on bamboo using
srun -n1 -c32 -p public-gpu --mem-per-gpu=80G --gpus=1 --gres=gpu:1 -t 1:00:00 python ...
and while it works, I keep getting a resource exhausted message from JAX when it tries to allocate ~15Gb, how can I make sure I am using the right node and perhaps check my total memory usage? I’ve checked and it usually allocates to gpu001 node which does not have the large memory GPUs. Is there any way I can just test if my code will run there? (i.e. if memory is enough for my problem)?
Thanks in advance
Hi Daniel,
When you start a job with Slurm, it gets saved in the slurmdbd
. You can check the status or details of your job using commands like sacct
or squeue
:
squeue
: Shows information about jobs in the Slurm scheduling queue.sacct
: Displays detailed accounting data for jobs in the Slurm job accounting log or database.
For instance, to get detailed information about your job (59075), you can use:
sacct -X -o Jobid%15,jobname,account,user,nodelist,ReqCPUS,ReqMem,ntask,start,end,Elapsed,state -j 59075
JobID JobName Account User NodeList ReqCPUS ReqMem NTasks Start End Elapsed State
--------------- ---------- ---------- --------- --------------- -------- ---------- -------- ------------------- ------------------- ---------- ----------
59075 python pepe dforeros gpu003 10 30000M 2024-08-14T14:07:02 2024-08-14T14:10:13 00:03:11 COMPLETED
You can also use the seff command, which gives a summary of the job’s efficiency:
(bamboo)-[root@admin1 ~]$ seff 59075
Job ID: 59075
Cluster: bamboo
User/Group: dforeros/hpc_users
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 10
CPU Utilized: 00:01:08
CPU Efficiency: 3.56% of 00:31:50 core-walltime
Job Wall-clock time: 00:03:11
Memory Utilized: 25.31 GB
Memory Efficiency: 86.40% of 29.30 GB
Just a heads-up, though: seff reports job stats after the job completes, so it gives a summary rather than real-time data.
What does mean large memory GPU *
Here the gpu specification on Bamboo:
Node | Type | Model | GPU Memory (VRam) | gpu count |
---|---|---|---|---|
gpu[001-002] | Ampere | NVIDIA GeForce RTX 3090 | 24 GB | 8 |
gpu003 | Ampere | NVIDIA A100 80GB PCIe | 80GB | 4 |
It’s also important to differentiate between GPU memory (VRAM) and system memory (RAM).
If you specify --gres=gpu:1,VramPerGpu:80G
you will exclude RTX 3090 node.
As you can see there are several types of GPU, including one with “large memory” (*) A100;
But before opting for it, it’s important to ask yourself whether you really need it. The A100 is about 20 times more expensive than the RTX 3090, and there are fewer of them in the cluster. So, you might end up blocking a valuable resource without gaining much in performance, depending on your workload.
For example, in job 59075, seff shows that you’re using almost 25GB of system memory. You might find that running the job on gpu[001-002] (with the RTX 3090) won’t significantly impact performance. It’s something you could test:
srun --ntasks=1 --cpus-per-task=1 --nodes=1 --nodelist=gpu[001-002] --partition=public-gpu --mem=25G --gres=gpu:1,VramPerGpu:24G Python ...
BUT it’s clear that we’d rather have an operational server than a standby server for certain jobs requiring only A100s. As I don’t have a precise idea of the number of jobs requiring this prerogative, I can’t say. So I’d say it depends on your patience and the urgency of your project.
(I have to admit that my answer is nuanced and raises other questions.)