[Baobab] General issue with using the cluster. Timeouts, hanging jobs, network issues?


I seem to be having general issues with jobs running on baobab this morning.
I’m currently debugging some scripts in a jupyter notebook on a CPU node and have hit the following issues:

  • The port forwarding over ssh sometimes reports a timeout or connection refused
  • Processes take a very long time to start up
  • Individual calls to the interactive kernel can hang for a very long time
  • The kernel stops responding entirely and needs to be rebooted

I thought it might be disk read issues, but even when everything is in memory it is painfully slow, but it could be a general I/O slowdown? Or perhaps an effect of sending my commands over ssh port forwarding?

I know I’m not the only one with issues though. @Matthew.Leigh is having non-interactive jobs fail due to timeouts (which were running perfectly fine first thing this morning).

Any ideas?

Hi @John.Raine

Thank you for reporting. I will check righ now what is going wrong.

Hi @John.Raine

Do you still have the issue ?

Hi Adrien,

Now it seems that things have gone back to normal.