Dear HPC team,
My jobs to gpu006 are not being assigned, and with scontrol I can see that the node seems to be down:
Reason=health_ps___blocked
Best, Ramón.
Dear HPC team,
My jobs to gpu006 are not being assigned, and with scontrol I can see that the node seems to be down:
Reason=health_ps___blocked
Best, Ramón.
Also, gpu009 is down:
Reason=health_cuda___GPU_broken
Hello,
Nodes have been set again in production, we have a lot of issues with gpu and I’m checking if there are nvidia driver issues.
Nest regards,