New slurm error

Hi HPC team.

Some of my recent jobs have been failing but with some inconsistency and it looks like it has to do with the allocation of CPUs. I also should note that this error is not reproducible 100% of the time, and only started yesterday though many of the submission scripts have not changed in months.

Here is the error I keep getting (this is the entirety of the log file):

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000E0000000002000000000000000000.
srun: error: Task launch for StepId=59004843.0 failed on node gpu023: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted

Sometimes the allocated CPUs value is different. I have had:
0x000000F0000000000000000000000000
0x80000
etc

I originally thought this had something do with the way I requested multiple CPUs but even after removing the cpus-per-task command in the submission script the problem persists.

Here is my current submission file

#!/bin/sh

#SBATCH --job-name=ContSupervised
#SBATCH --output=%A_%a_log.out
#SBATCH --mem=32GB
#SBATCH --time=12:00:00
#SBATCH --partition=private-dpnc-gpu,shared-gpu
#SBATCH --gpus=1

#SBATCH -a 0-3

name=( 58998332_0 58998332_8 58998332_12 58998332_4 )

export XDG_RUNTIME_DIR=""
module load GCC/9.3.0 Singularity/3.7.3-GCC-9.3.0-Go-1.14
cd /home/users/l/leighm/graphnets/
srun singularity exec --nv -B /srv,/home \
   /home/users/l/leighm/Images/gnets-image \
   python train_disc.py --name ${name[`expr ${SLURM_ARRAY_TASK_ID} / 1 % 4`]} \

Hi, thanks for the report. I’ll check what is going on.

Hi thanks for the response.
So it seems that there is further issues with my other scripts and now it seems that none of my batch jobs are running.

This is my other submission script

#!/bin/sh

#SBATCH --job-name=WithEdges
#SBATCH --output=./logs/%A_%a.out
#SBATCH --cpus-per-task=8
#SBATCH --mem=32GB
#SBATCH --partition=private-dpnc-gpu,shared-gpu
#SBATCH --gpus=1
#SBATCH --time=12:00:00

#SBATCH -a 0-3

net_conf=( config/graph_enc.yaml config/trans_enc.yaml )
data_conf=( config/data.yaml config/data_QW.yaml )
train_conf=( config/train.yaml )
num_workers=( 8 )
n_csts=( 64 )
del_r_edges=( 999 )

export XDG_RUNTIME_DIR=""
module load GCC/9.3.0 Singularity/3.7.3-GCC-9.3.0-Go-1.14
cd /home/users/l/leighm/graphnets/
srun singularity exec --nv -B /srv,/home \
   /home/users/l/leighm/Images/gnets-image \
   python train_disc.py \
       --net_conf ${net_conf[`expr ${SLURM_ARRAY_TASK_ID} / 1 % 2`]} \
       --data_conf ${data_conf[`expr ${SLURM_ARRAY_TASK_ID} / 2 % 2`]} \
       --train_conf ${train_conf[`expr ${SLURM_ARRAY_TASK_ID} / 4 % 1`]} \
       --num_workers ${num_workers[`expr ${SLURM_ARRAY_TASK_ID} / 4 % 1`]} \
       --n_csts ${n_csts[`expr ${SLURM_ARRAY_TASK_ID} / 4 % 1`]} \
       --del_r_edges ${del_r_edges[`expr ${SLURM_ARRAY_TASK_ID} / 4 % 1`]} \
       --save_dir /home/users/l/leighm/scratch/Saved_Networks/JetTag/WhitePaper/WithEdges/\
       --name ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}\
       --tqdm_quiet \

These are the messages (multiple log files)

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000000000000000000000003FC.
srun: error: Task launch for StepId=59008526.0 failed on node gpu023: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x0000000000000000000000000003FC00.
srun: error: Task launch for StepId=59008527.0 failed on node gpu023: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted


srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000000000000000000000003FC.
srun: error: Task launch for StepId=59008529.0 failed on node gpu023: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted

Hi,

So I have been trying to narrow down the problem more and more, and it looks like I can’t even submit a single slurm job containin srun!

This is my slurm script to just echo Hello!. Note that I am no longer a gpu node, I think this was sent to the cpu debug node.

#!/bin/sh

#SBATCH --job-name=Hello
#SBATCH --output=%A_%a_log.out
#SBATCH --mem=4GB
#SBATCH --time=00:01:00

srun echo Hello!

And I get the same result

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x0001.
srun: error: Task launch for StepId=59008907.0 failed on node cpu001: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted

Hi,

I created a directory test_sagon in your home and copy pasted your hello script. I submitted it as yourself 10 time and it works fine. How do you submit your sbatch script please?

Best

Hi,

So I went into test_sagon and ran: sbatch hello.sh and got the same error.
You can see it in file 59009521_4294967294_log.out

This definetly seems to be a problem with my account, but I have not changed any slurm configurations since running jobs yesterday?

Matt

Hi Yann,

I think I found the error. I was submitting this job in the terminal while logged on to cpu277. This is where I work/code/debug before submitting gpu jobs.

When I submitted from the login node it seems that the echo script worked.

So has something changed recently?

Matt

Hi,

as I said, I tested with your account, from login2.

Can you let me know how do you connect to the cluster (ssh client, OS, etc)
Do you load some environment, execute some custom init script before launching your sbatch script?
I’ve submitted 10 more jobs with your account without issue.

Can you share the full output like this one:

(baobab)-[leighm@login2 test_sagon]$ sbatch hello.sh
Submitted batch job 59009619
(baobab)-[leighm@login2 test_sagon]$ ls -lart 59009619_4294967294_log.out
-rw-r--r-- 1 leighm private_dpnc 7 Jul 19 15:49 59009619_4294967294_log.out

I tried while connected to a cpu node, it is working too.

This is how how I proceed:

(baobab)-[leighm@login2 test_sagon]$ salloc
salloc: Pending job allocation 59009692
salloc: job 59009692 queued and waiting for resources
salloc: job 59009692 has been allocated resources
salloc: Granted job allocation 59009692
(baobab)-[leighm@cpu001 test_sagon]$ sbatch hello.sh
Submitted batch job 59009696
(baobab)-[leighm@cpu001 test_sagon]$ ls -lart 59009696_4294967294_log.out
-rw-r--r-- 1 leighm private_dpnc 7 Jul 19 15:59 59009696_4294967294_log.out

Please give the full output of your session to compare.