Drained GPUs on baobab?

Hi, there seem to be a lot of drained GPUs on Baobab.

Why is this happening?


Dear users,

Since few times gpu are draining regarding multiple cause, we are working to restore the max of gpu in production evry day.

For example gpu044 was drain because of a brocken gpu, we receive yesterday the part and this has been changed this morning.

Other gpu have storage issue sometimes or process stuck. We hope next maintenance will solve those kind of issue by updating kernel and infiniband fabric.

Actually situation of gpu pool is ok and you jobs will be triggered as soon as possible :

(baobab)-[root@admin1 ~]$ sinfo | grep gpu
shared-gpu                     up   12:00:00      1   drng gpu012
shared-gpu                     up   12:00:00     27    mix gpu[007,009,011,013-016,020,022-024,027-032,035-043,045]
shared-gpu                     up   12:00:00      1  alloc gpu033
shared-gpu                     up   12:00:00     13   idle gpu[002,004-006,008,017-019,021,025-026,034,044]

Best regards,