SLURM timeout on Baobab 17.01.2019-18.01.2019

Since 17.01.2019, you may have experienced issue with job submissions on Baobab.

The symptom was a non response from slurm commands and this kind of error messages:

slurm_load_jobs error: Unable to contact slurm controller (connect failure)
slurm_load_jobs error: Socket timed out on send/recv operation

The reason is that a very high number of jobs was submitted to Baobab, ~40k.

I remind you that if you have a lot of similar jobs, the best practice is to group them in a job array.

I have modified the slurm configuration to try to mitigate this issue and it seems it’s working fine like that.

The following parameters where modified in slurm.conf:

Before:
SchedulerParameters=bf_continue,bf_window=40320,bf_max_job_test=400,bf_max_job_user=50
After:
SchedulerParameters=bf_continue,bf_window=40320,bf_max_job_test=400,bf_max_job_user=50,bf_resolution=200,max_rpc_cnt=32

Best regards