[baobab][slurm] gpu nodes inval

Primary informations

Username: falkiewi
Cluster: baobab

Description

Dear @support,

There seems to be a problem with most of the GPU nodes on the cluster

(baobab)-[falkiewi@login1 ~]$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
issue-5894           root      2025-04-02T16:01:38 cpu242
issue-6543           root      2025-03-24T09:32:28 cpu319
issue-5958           root      2024-12-02T09:39:22 cpu246
gres/gpu GRES core s slurm     2025-04-03T15:48:07 gpu030
gres/gpu count repor slurm     2025-04-02T15:47:44 gpu002
gres/gpu GRES core s slurm     2025-04-03T15:48:07 gpu[004-009]
gres/gpu count repor slurm     2025-04-03T15:48:07 gpu010
gres/gpu GRES core s slurm     2025-04-02T15:47:44 gpu011
gres/gpu GRES core s slurm     2025-04-03T15:48:07 gpu[013-017,021-026,034-044,046-049]
gres/gpu GRES core s slurm     2025-04-03T15:48:07 gpu[018-020,027-029,031-033,045]

I kindly ask for a quick solution to the problem and thank you in advance!

Best regards,
Maciej Falkiewicz

Dear @maciej.falkiewicz

This is a bug in slurmd: 22498 – GRES cores doesn't match socket boundaries

In the meantime, I’ll resume the nodes manually.