Hi, there seem to be a lot of drained GPUs on Baobab.
Why is this happening?
Cheers,
Malte
Dear users,
Since few times gpu are draining regarding multiple cause, we are working to restore the max of gpu in production evry day.
For example gpu044 was drain because of a brocken gpu, we receive yesterday the part and this has been changed this morning.
Other gpu have storage issue sometimes or process stuck. We hope next maintenance will solve those kind of issue by updating kernel and infiniband fabric.
Actually situation of gpu pool is ok and you jobs will be triggered as soon as possible :
(baobab)-[root@admin1 ~]$ sinfo | grep gpu
shared-gpu up 12:00:00 1 drng gpu012
shared-gpu up 12:00:00 27 mix gpu[007,009,011,013-016,020,022-024,027-032,035-043,045]
shared-gpu up 12:00:00 1 alloc gpu033
shared-gpu up 12:00:00 13 idle gpu[002,004-006,008,017-019,021,025-026,034,044]
Best regards,