SLURM issues after maintenance

Primary informations

Username: chindemi
Cluster: Yggdrasil

Description

After the maintenance I noticed several issues with the scheduler:

  • Submission errors (i.e. sbatch: error: Batch job submission failed: Unable to contact slurm controller (connect failure)),
  • Resource specification errors (i.e. requesting a GPU but getting a job without GPU enabled, see job 15141474),
  • Completed job not releasing the resources.

Could you please have a look?

Thanks!

Giuseppe

Hi @Giuseppe.Chindemi ,

I quickly solved the issue, but I will investigate tomorrow on the morning.

It could happen again during the night :confused:

Best regards

Thank you very much Adrien!

1 Like

Hi,

I’m having the same issue in yggdrasil. unable to contact slurm controller.

Thanks!

Hi,

Same, “unable to contact slurm controller”
Thank you

Hi,

I am observing jobs with a timeout after 19:33:25 for example, where the time limit was 12h. Based on the logs I see that the program execution finished in 2:07:37.

Best regards,
Maciej Falkiewicz

Hello,

Over the weekend I observed the following:

  • 2 jobs restarted after completion,
  • Job 15244189 is still running, but the actual program terminated yesterday evening.

Cheers,
Giuseppe

Hi @maciej.falkiewicz

Could you give the JobID ?

Hi All,

The issue has been solved. Slurm can have a strange behavior when it fails, so we will not take into account all the problems mentioned during the failure.

If it occur again let us know, we will investigate.

We apologize for the inconvenience

2 Likes