Hi all,
I wanted to use the following commend sacct -u conradin
to check if I have running jobs, and I got the following error
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm1:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
Is it related to the maintenance on baobab?
Best,
Raphaël
Hi there,
Yes, it was, fixed:
[capello@login1 ~]$ sacct -u capello
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
[capello@login1 ~]$
Sorry for the inconvenience.
Thx, bye,
Luca
Hi,
this error seems to be popping up again on Yggdrasil. Tested just now.
(noisepy) [savardg@login1.yggdrasil CCF_25sps_3c_coherence]$ sacct -j 8061735
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm1:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
Thanks in advance,
Genevieve
1 Like
Hi, this is fixed! Thanks for letting us know.
1 Like
Hi, this issue is back again for me on Ygg.
(base) [savardg@login1.yggdrasil ~]$ sacct -j 8163959
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm1:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
Hi, this is corrected, thanks.
Hi, I’m sorry to report the sacct
error is back.
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:slurm1:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
Arf you are right! I restarted a server this morning and this killed our ssh tunnel.
Thanks for the monitoring:)
Hi Yann, the error is back. Sorry to keep bugging you with this one. I always use sacct to check the status of completed jobs, I guess other users don’t use it as much since they don’t seem to notice.
perhaps there could be a script that routinely checks if sacct fails and restart the server if necessary? not sure how feasible that would be or if it would affect performance.
Hi, this time this is related with Baobab maintenance… you’ll have to wait a little bit and we’ll reactivate this service. Yes we already have a script… but it seems it isn’t always working as expected. We’ll try to improve that in the future.
1 Like
Ok no problem, good to know! Thanks, Yann.
Good evening,
I guess you know what is the problem
Yes… Yggdrasil is using login2.baobab to communicate with Baobab and login2 crashed
Service restored.
1 Like