I have some job with this status. What does it mean?
Should I cancel theses jobs and relaunch them?
this happened because one of you job was allocated a node (gpu009 as example) and during the start of the job (the prolog), slurm encountered an error and put the node in drain. Thus as your job was expecting to use this particular node, it was not possible for slurm to relaunch it.
We reset the nodes in idle state and you jobs should have started.
Hi thanks of your answer. I have two jobs in node gpu009 that did just launch. I don’t know if it is a job which was blocked in this status or another job. However, I have a dozen of other job still in Launch failed requeued held.
those jobs are waiting for resources I guess.
Indeed, it seems that some of them aren’t release automatically, don’t know why.
You may release them yourself (exemple job id):
scontrol release 19548514
Ok thanks it worked.