Primary informations
Username: hentsche
Cluster: baobab
Jobid: 837908
Description
I requested a GPU and 80GB of VRAM and got allocated an NVIDIA A100 80GB
.
There was another user’s process running on the GPU, causing my process to fail for not being able to allocate enough VRAM.
Steps to Reproduce
Request a tunnel using the following sbatch command (linebreaks added here for legibility), and connect to it with vscode:
sbatch
--job-name=codeTunnel
--time=08:00:00
--error=/home/users/h/hentsche/slurm/%j/stderr
--output=/home/users/h/hentsche/slurm/%j/stdout
--mail-type=ALL
--mem=300000
--cpus-per-task=32
--gres=gpu:1,VramPerGpu:80G
--partition=shared-gpu
/home/users/h/hentsche/slurm_tunnel/tunnel.sh
In the created session, run a command that requires <80GB of VRAM.
Expected Result
The command runs and has enough VRAM.
Actual Result
My process failed because it ran out of VRAM at ca 62GB usage.
Below is a screen shot of the VRAM usage from while my process was running and one from after my process failed. The other process belongs to another user.