7 nodes on the private-dpt-cpu partition are in drain, and seem to have been in this state for a while (don’t know the exact amount of time). Because of this, there seems to be a lot of pile up of jobs. Can the admins please take a look?
In the same vein, gpu025, gpu026, and gpu027 also seem to constantly go into drain and then be back over the last week (currently only gpu025 is in drain). Can the admins please have a look at that as well?
We are upgrading the BIOS of every compute node (issue-grl) and you are right this may block some of your compute nodes. We’re restored them now.
Extra hint: in Baobab and Bamboo, we have a lot of idle compute node that you can use as well if 12h00 of compute time is enough for your job. In this case, specify a max time of 12h00 or less, and specify the partition as ---partition=private-dpt-cpu,shared-cpu
The issue with gpu is a known issue, unfortunately there is not much we can do other than supervise them and restore them. They are used by many people for multiple application and it is hard to understand what trigger them to go in drain.