Second user on GPU

Primary informations

Username: hentsche
Cluster: baobab
Jobid: 837908

Description

I requested a GPU and 80GB of VRAM and got allocated an NVIDIA A100 80GB.
There was another user’s process running on the GPU, causing my process to fail for not being able to allocate enough VRAM.

Steps to Reproduce

Request a tunnel using the following sbatch command (linebreaks added here for legibility), and connect to it with vscode:

sbatch
  --job-name=codeTunnel
  --time=08:00:00
  --error=/home/users/h/hentsche/slurm/%j/stderr
  --output=/home/users/h/hentsche/slurm/%j/stdout
  --mail-type=ALL
  --mem=300000
  --cpus-per-task=32
  --gres=gpu:1,VramPerGpu:80G
  --partition=shared-gpu
  /home/users/h/hentsche/slurm_tunnel/tunnel.sh

In the created session, run a command that requires <80GB of VRAM.

Expected Result

The command runs and has enough VRAM.

Actual Result

My process failed because it ran out of VRAM at ca 62GB usage.
Below is a screen shot of the VRAM usage from while my process was running and one from after my process failed. The other process belongs to another user.

Dear @Manuel.Hentschel thanks for the notification, this is now solved.

1 Like