Primary informations
Username: chindemi
Cluster: Yggdrasil
Description
After the maintenance I noticed several issues with the scheduler:
- Submission errors (i.e.
sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)
),
- Resource specification errors (i.e. requesting a GPU but getting a job without GPU enabled, see job 15141474),
- Completed job not releasing the resources.
Could you please have a look?
Thanks!
Giuseppe
Hi @Giuseppe.Chindemi ,
I quickly solved the issue, but I will investigate tomorrow on the morning.
It could happen again during the night
Best regards
Thank you very much Adrien!
1 Like
Hi,
I’m having the same issue in yggdrasil. unable to contact slurm controller.
Thanks!
Hi,
Same, “unable to contact slurm controller”
Thank you
Hi,
I am observing jobs with a timeout after 19:33:25 for example, where the time limit was 12h. Based on the logs I see that the program execution finished in 2:07:37.
Best regards,
Maciej Falkiewicz
Hello,
Over the weekend I observed the following:
- 2 jobs restarted after completion,
- Job 15244189 is still running, but the actual program terminated yesterday evening.
Cheers,
Giuseppe
Hi @maciej.falkiewicz
Could you give the JobID ?
Hi All,
The issue has been solved. Slurm can have a strange behavior when it fails, so we will not take into account all the problems mentioned during the failure.
If it occur again let us know, we will investigate.
We apologize for the inconvenience
2 Likes