Hello,
At the beginning I was thinking that the jobs don’t start because of priority/resource reason that I still do not undestand (Question about priority and job not starting while resources are available). But it seems that there is an other problem.
My job is infinitely pending and I don’t know why, I checked if gpu020
and gpu031
have features COMPUTE_TYPE_AMPERE
and DOUBLE_PRECISION_GPU
activated and it is the case.
Also both nodes are idle
. Only thing I know is that the CPUload
is red and flagged *
(~ 0.90*
).
I have absolutely no idea of the problem…
Thank you in advance,
Best,
Ludovic
After ~24h the job started on gpu030
, but I don’t know why ?
I am supposed to have an higher priority on this node (PRIO_JOB_FACTOR
= 4) but I had to wait 1 day for a job of 10/30min while the same simulation starts immediately on gpu027
on shared-gpu
partition with PRIO_JOB_FACTOR
= 1 (both were idle
).
Is there a problem with gpu[020,031]
?
Or I just probably misunderstood something about priority/resources/constraint/partition/etc… ? I feel a bit lost…
Thank you.
They seems ok. I tested right now with your account to launch a dummy job:
[dumoulil@login2.baobab ~]$ srun --partition=private-kruse-gpu --nodelist=gpu031 hostname
When you say the nodes are idle: do you mean according to slurm (sinfo
or similar command) ?
Can you let us know your sbatch
please?
Thank you,
Yes according to slurm (sinfo
and pestat
)
The bash info are the following:
#!/bin/env bash
#SBATCH --partition=private-kruse-gpu
#SBATCH --time=0-01:00:00
#SBATCH --gpus=ampere:1
#SBATCH --constraint=DOUBLE_PRECISION_GPU
#SBATCH --output=%J.out
#SBATCH --mem=3000
(job id: 58378431
)
I did the same thing (job id: 58378508
) but with
#SBATCH --partition=private-kruse-gpu,shared-gpu
and not only private-kruse-gpu
and the job started immediately on gpu022
.
the nodes gpu[020,031]
are idle