Primary informations
Username: hentsche
Cluster: baobab
Jobid: 837908
Description
I requested a GPU and 80GB of VRAM and got allocated an NVIDIA A100 80GB.
There was another user’s process running on the GPU, causing my process to fail for not being able to allocate enough VRAM.
Steps to Reproduce
Request a tunnel using the following sbatch command (linebreaks added here for legibility), and connect to it with vscode:
sbatch
--job-name=codeTunnel
--time=08:00:00
--error=/home/users/h/hentsche/slurm/%j/stderr
--output=/home/users/h/hentsche/slurm/%j/stdout
--mail-type=ALL
--mem=300000
--cpus-per-task=32
--gres=gpu:1,VramPerGpu:80G
--partition=shared-gpu
/home/users/h/hentsche/slurm_tunnel/tunnel.sh
In the created session, run a command that requires <80GB of VRAM.
Expected Result
The command runs and has enough VRAM.
Actual Result
My process failed because it ran out of VRAM at ca 62GB usage.
Below is a screen shot of the VRAM usage from while my process was running and one from after my process failed. The other process belongs to another user.

