Hi,
I’m trying to submit GPU jobs to the dpnc-gpu-EL7 partition and I’m encountering an error I had never seen before in the log files:
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.
The batch options of my script are:
#!/bin/env bash
#SBATCH --time=11:00:00
#SBATCH --partition=dpnc-gpu-EL7
#SBATCH --gres=gpu:1
#SBATCH --constraint=“V3|V4|V5|V6”
#SBATCH --mem=15G
#SBATCH --output=logs/train-%j.out
#SBATCH --job-name=‘DNN_train’
What’s interesting is that a couple days ago I ran a script which is completely identical, but on the shared-gpu-EL7 partition with the exact same options. These jobs didn’t crash.
Is there something off with the dpnc-gpu-EL7 partition?
Relevant path:
“Faulty” submission script: /home/drozd/analysis/runs/run_07Feb20_addSTK/runTraining_faulty.sh
“Faulty” logs: /home/drozd/analysis/runs/run_07Feb20_addSTK/logs/train-29797626.out
/home/drozd/analysis/runs/run_07Feb20_addSTK/logs/train-29797630.out
“Good” submission script: /home/drozd/analysis/runs/run_06Jan20_multiVars/runTraining.sh
“Good” log: /home/drozd/analysis/runs/run_06Jan20_multiVars/logs/train-29798226.out
The workaround looks to be as simple as using the shared-gpu queue instead of the DPNC one, but I’m curious about this issue…
Cheers
Hi there,
Not that we are aware of, also considering that the private partitions are simply a subset of the common partitions, thus the node configuration does not change.
Even when connected as your account, I was not able to reproduce your error on the dpnc-gpu-EL7
partition with the following command…
srun -p dpnc-gpu-EL7 --nodelist=gpu002 --time=11:00:00 --gres=gpu:1 --constraint="V3|V4|V5|V6" --mem=15G -n 1 -c 1 --pty $SHELL
…nor with a modified version of your sbatch above (JobId 30059441), is the error still present?
Two weeks ago we had another report with a similar error, but again I was not able to reproduce it and the user has not provided feedback yet.
Thx, bye,
Luca
Hi Luca,
I can’t reproduce either, using the same script as before. Strange.
Hi there,
OK, considering the issue temporary (and solved) for the moment, feel free to come back if it happens again.
Thx, bye,
Luca
Hi,
just to say this happens to me from time to time. Instead of specifying mem-per-cpu as below:
#SBATCH --job-name=S1_1bit_tuscany # create a name for your job
#SBATCH --partition=public-cpu,public-bigmem
#SBATCH --ntasks=67 # total number of tasks
#SBATCH --cpus-per-task=1 # cpu-cores per task
#SBATCH --mem-per-cpu=12G # memory per cpu-core
#SBATCH --time=2-00:00:00 # total run time limit (HH:MM:SS)
#SBATCH --output="outslurm/slurm-%j-%x.out"
If I replace #SBATCH --mem-per-cpu=
by #SBATCH --mem=
, it fixes the issue. However, this is annoying because it is more inconvenient specifying mem-per-node than mem-per-cpu, as I don’t care on how many nodes the tasks run.
Pretty mysterious error that comes and go… Some days it doesn’t complain about mem-per-cpu and some other days it does.