GPU017 is down (again)

Hi, just noticed that gpu017 is down.
Is there something wrong with this node, this is unusually down most of the time.

Reagrds,
Deb

Hi,

We are aware about it. The node does not poweron.

Hi,

as this is one of the nodes we purchased it would be great to know what the problem is. Is there an update for the current status?

I see the node is still down, and has been now for over two weeks. This is also not an uncommon occurance, and the node keeps on being unresponsive to slurm and ending up in the down state.

Fortunately our new GPUs have been installed recently (thanks again!) but the 3090s on gpu017 are our most powerful cards, and it would be great to have them back in production, and stable.

Cheers,
Johnny

Hi Johnny,

you are right, this isn’t a good situation. I restarted the node right now. The risk is that this node will power off again sooner or later. We had the same issue for another similar gpu node, and we had to replace the CPUs.

I checked with the vendor what they suggest as of course this node is under warranty.

Best

Hi Yann,

that’s great, thanks!
I’ll keep an eye on it and let you know if or when something happens, not to apply pressure but to at least reduce one thing to keep an eye on from your side.

Cheers,
Johnny

Looks like no change. Still down (though with the same reason from last month).

Johnny

Hi, unfortunately the node doesn’t power on anymore:( I’ll check if someone is near the DC to see what happens.

Hi, I was able to turn on gpu017 this morning. It is now IDLE. I think the vendor will exchange the two CPUs but in the mean time the server should we usable.

Best

1 Like