Gpu017 is down (once again)

gpu017 is down again with the following message.

Reason=Node unexpectedly rebooted [slurm@2022-04-20T11:21:08]


Looks like it is back up!

The title of the topic doesn’t sounds very kind. And the post was written only 6 minutes after the node went down.

Anyway, the situation is the following: we are aware this GPU server has an issue. We have to swap two CPUs on it, we’ll do it as soon as possible, we are waiting some thermal paste from the vendor. In the mean time, the server is restarted and will crash… once again :wink:

Do not take it personally, we always appreciate when users notify us of issues etc, but without too much pressure :slight_smile: Thanks for the notification!

Hello Yann,
Apologies if the title sounds harsh, it was complaining that a similar title already existed.
The reason I posted immediately was indeed the reason you say - this particular node is the most performant node (at least compared to gpu023 and gpu024 which is always slow for me) but is frequently down.

Thanks again for looking into this. It would be great to have this node be more reliably available.


No problem.

thanks for insisting about slowness of gpu023, this is now fixed!

