Since 17.01.2019, you may have experienced issue with job submissions on Baobab.
The symptom was a non response from slurm commands and this kind of error messages:
slurm_load_jobs error: Unable to contact slurm controller (connect failure)
slurm_load_jobs error: Socket timed out on send/recv operation
The reason is that a very high number of jobs was submitted to Baobab, ~40k.
I remind you that if you have a lot of similar jobs, the best practice is to group them in a job array.
I have modified the slurm configuration to try to mitigate this issue and it seems it’s working fine like that.
The following parameters where modified in slurm.conf
:
Before:
SchedulerParameters=bf_continue,bf_window=40320,bf_max_job_test=400,bf_max_job_user=50
After:
SchedulerParameters=bf_continue,bf_window=40320,bf_max_job_test=400,bf_max_job_user=50,bf_resolution=200,max_rpc_cnt=32
Best regards