Jobs on private-kruse-gpu never start (gpu020, gpu031)


At the beginning I was thinking that the jobs don’t start because of priority/resource reason that I still do not undestand (Question about priority and job not starting while resources are available). But it seems that there is an other problem.

My job is infinitely pending and I don’t know why, I checked if gpu020 and gpu031 have features COMPUTE_TYPE_AMPERE and DOUBLE_PRECISION_GPU activated and it is the case.
Also both nodes are idle. Only thing I know is that the CPUload is red and flagged * (~ 0.90*).

I have absolutely no idea of the problem…

After ~24h the job started on gpu030, but I don’t know why ?
I am supposed to have an higher priority on this node (PRIO_JOB_FACTOR = 4) but I had to wait 1 day for a job of 10/30min while the same simulation starts immediately on gpu027 on shared-gpu partition with PRIO_JOB_FACTOR = 1 (both were idle).

Is there a problem with gpu[020,031] ?
Or I just probably misunderstood something about priority/resources/constraint/partition/etc… ? I feel a bit lost…
They seems ok. I tested right now with your account to launch a dummy job:

[dumoulil@login2.baobab ~]$ srun --partition=private-kruse-gpu --nodelist=gpu031 hostname

When you say the nodes are idle: do you mean according to slurm (sinfo or similar command) ?

Can you let us know your sbatch please?

Yes according to slurm (sinfo and pestat)

The bash info are the following:

#!/bin/env bash
#SBATCH --partition=private-kruse-gpu
#SBATCH --time=0-01:00:00
#SBATCH --gpus=ampere:1
#SBATCH --output=%J.out
#SBATCH --mem=3000

(job id: 58378431)

I did the same thing (job id: 58378508) but with

#SBATCH --partition=private-kruse-gpu,shared-gpu

and not only private-kruse-gpu and the job started immediately on gpu022.
the nodes gpu[020,031] are idle

