Primary informations
Username: falkiewi
Cluster: Baobab
Dear @support ,
The majority of nodes on the cluster are in drain, drng or down state.
$ sinfo -R
REASON USER TIMESTAMP NODELIST
issue-6277 : Not res slurm 2024-08-26T14:28:56 cpu[002,005,007-008,045-056,058-059,061-062]
issue-6277 root 2024-08-22T10:44:06 cpu004
health_BEEGFS__tcp_c root 2024-09-02T11:40:07 cpu[156,173-183,187-188,190-193,196-198,208,220-221,240,242,270-271,274,287,290,292-293,295,303,308-311]
Not responding slurm 2024-08-30T21:25:06 cpu285
health_BEEGFS__tcp_c root 2024-09-02T10:52:06 cpu[247,253]
health_BEEGFS__tcp_c root 2024-09-02T10:52:07 cpu251
health_BEEGFS__tcp_c root 2024-09-02T10:52:08 cpu252
health_BEEGFS__tcp_c root 2024-09-02T11:43:07 cpu326,gpu017
health_BEEGFS__tcp_c root 2024-09-02T10:52:05 cpu[065-066,194,223,262,273,275-276,298,301,304,317]
health_BEEGFS__tcp_c root 2024-09-02T10:58:05 cpu[157-160,162,164-169,171-172,184,189,195,200-202,216-217,225,263-264,294,306-307,332],gpu005
health_BEEGFS__tcp_c root 2024-09-02T10:58:05 cpu161
health_BEEGFS__tcp_c root 2024-09-02T10:58:06 cpu185
health_BEEGFS__tcp_c root 2024-09-02T11:40:08 cpu[199,205-206,222]
health_BEEGFS__tcp_c root 2024-09-02T10:52:05 cpu[207,249-250,254-261,299-300,305,315,319,321]
health_BEEGFS__tcp_c root 2024-09-02T10:52:08 cpu224
issue-6277 : Not res slurm 2024-08-26T14:27:16 cpu186
health_cuda___nvidia root 2024-09-02T10:55:08 gpu034
health_cuda___nvidia root 2024-09-02T00:21:47 gpu002
health_cuda___nvidia root 2024-08-30T00:46:53 gpu008
Kill task failed root 2024-08-28T03:30:07 gpu012
Kill task failed root 2024-08-28T21:17:07 gpu018
Kill task failed root 2024-08-31T03:39:02 gpu025
issue-6299 root 2024-08-29T15:14:10 gpu047
Kind regards,
Maciej Falkiewicz