Srun fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive

David.Droz · February 11, 2020, 10:46am

Hi,

I’m trying to submit GPU jobs to the dpnc-gpu-EL7 partition and I’m encountering an error I had never seen before in the log files:
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

The batch options of my script are:
#!/bin/env bash
#SBATCH --time=11:00:00
#SBATCH --partition=dpnc-gpu-EL7
#SBATCH --gres=gpu:1
#SBATCH --constraint=“V3|V4|V5|V6”
#SBATCH --mem=15G
#SBATCH --output=logs/train-%j.out
#SBATCH --job-name=‘DNN_train’

What’s interesting is that a couple days ago I ran a script which is completely identical, but on the shared-gpu-EL7 partition with the exact same options. These jobs didn’t crash.

Is there something off with the dpnc-gpu-EL7 partition?

Relevant path:
“Faulty” submission script: /home/drozd/analysis/runs/run_07Feb20_addSTK/runTraining_faulty.sh
“Faulty” logs: /home/drozd/analysis/runs/run_07Feb20_addSTK/logs/train-29797626.out
/home/drozd/analysis/runs/run_07Feb20_addSTK/logs/train-29797630.out
“Good” submission script: /home/drozd/analysis/runs/run_06Jan20_multiVars/runTraining.sh
“Good” log: /home/drozd/analysis/runs/run_06Jan20_multiVars/logs/train-29798226.out

The workaround looks to be as simple as using the shared-gpu queue instead of the DPNC one, but I’m curious about this issue…

Cheers

Luca.Capello · February 17, 2020, 4:52pm

Hi there,

Not that we are aware of, also considering that the private partitions are simply a subset of the common partitions, thus the node configuration does not change.

Even when connected as your account, I was not able to reproduce your error on the dpnc-gpu-EL7 partition with the following command…

srun -p dpnc-gpu-EL7 --nodelist=gpu002 --time=11:00:00 --gres=gpu:1 --constraint="V3|V4|V5|V6" --mem=15G -n 1 -c 1 --pty $SHELL

…nor with a modified version of your sbatch above (JobId 30059441), is the error still present?

Two weeks ago we had another report with a similar error, but again I was not able to reproduce it and the user has not provided feedback yet.

Thx, bye,
Luca

David.Droz · February 18, 2020, 8:21am

Hi Luca,

I can’t reproduce either, using the same script as before. Strange.

Luca.Capello · February 20, 2020, 4:58pm

Hi there,

OK, considering the issue temporary (and solved) for the moment, feel free to come back if it happens again.

Thx, bye,
Luca

Genevieve.Savard · May 30, 2024, 6:25pm

Hi,
just to say this happens to me from time to time. Instead of specifying mem-per-cpu as below:

#SBATCH --job-name=S1_1bit_tuscany  # create a name for your job
#SBATCH --partition=public-cpu,public-bigmem
#SBATCH --ntasks=67              # total number of tasks
#SBATCH --cpus-per-task=1        # cpu-cores per task
#SBATCH --mem-per-cpu=12G         # memory per cpu-core
#SBATCH --time=2-00:00:00          # total run time limit (HH:MM:SS)
#SBATCH --output="outslurm/slurm-%j-%x.out"

If I replace #SBATCH --mem-per-cpu= by #SBATCH --mem= , it fixes the issue. However, this is annoying because it is more inconvenient specifying mem-per-node than mem-per-cpu, as I don’t care on how many nodes the tasks run.
Pretty mysterious error that comes and go… Some days it doesn’t complain about mem-per-cpu and some other days it does.