2023-02-22T10:07:00Z
Dear HPC User,
We regret to inform you that there is a cooling issue in DataCenter Dufour which has led us to reserve all nodes in order to avoid any new job starting.
This step has been taken to ensure the preservation of your running jobs.
We have taken this measure with the hope that it will be sufficient to reduce the temperature and resolve the issue.
However, if the situation worsens, we may be forced to shut down all the machines and terminate your running jobs.
Please be assured that we are working diligently to address this issue and minimize any disruptions that may be caused.
We will keep you informed of the progress of the situation.
We apologize for any inconvenience this may cause and appreciate your patience and understanding.
Best regards,
2023-02-22T15:00:00Z
The situation is stabilized and no further action impacting running jobs is expected (for now).
A technical team is currently working on the problem to solve it as soon as possible. The intervention should be completed tomorrow morning (while waiting for a spare part) and our services should resume their normal operation shortly after.
2023-02-23T13:23:00Z
The situation on Baobab has not changed, job submissions are suspended until further notice.
The login node and storage remains available for file transfer.
Good News: The maintenance of Yggdrasil is now complete.
2023-02-27T14:20:00Z
The situation on Baobab has not changed, job submissions are suspended until further notice. However we hope a back into production for this afternoon.
2023-02-28T09:30:00Z
The repair is taking longer than expected, we hope to have the datacenter operational by the end of the week.
2023-03-01T09:15:00Z
Dear User,
We are pleased to inform you that our technical team has successfully resolved the cooling issue in our DataCenter Dufour. However, before we fully resume our production, we will need to test the stability of our services by progressively rebooting nodes to ensure the cooling system is fully operational.
Here the nodelist back into production:
(baobab)-[root@admin1 ~]$ sinfo -n cpu[004,011,019-020,061,065,167,173,176-177,182-183,185,190-191,194,196-197,207,222,243,253,255,274,290,293],gpu[007,009-010,019,026,033,036,040]
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug-cpu* up 15:00 1 idle cpu004
public-cpu up 4-00:00:00 4 alloc cpu[011,019-020,243]
shared-cpu up 12:00:00 9 mix cpu[167,183,190,194,197,253,274,290,293]
shared-cpu up 12:00:00 16 alloc cpu[011,019-020,061,065,173,176-177,182,185,191,196,207,222,243,255]
shared-gpu up 12:00:00 6 mix gpu[007,009,019,026,033,040]
shared-gpu up 12:00:00 2 alloc gpu[010,036]
private-wesolowski-cpu up 7-00:00:00 1 mix cpu197
private-wesolowski-cpu up 7-00:00:00 2 alloc cpu[185,196]
private-lehmann-cpu up 7-00:00:00 1 alloc cpu061
private-dpt-cpu up 7-00:00:00 1 alloc cpu065
private-cui-cpu up 7-00:00:00 2 mix cpu[167,190]
private-cui-cpu up 7-00:00:00 2 alloc cpu[191,207]
private-kruse-cpu up 7-00:00:00 1 mix cpu194
private-gap-cpu up 7-00:00:00 1 mix cpu253
private-gap-cpu up 7-00:00:00 2 alloc cpu[222,255]
private-hepia-cpu up 7-00:00:00 1 mix cpu183
private-hepia-cpu up 7-00:00:00 4 alloc cpu[173,176-177,182]
private-gervasio-cpu up 7-00:00:00 1 mix cpu274
private-salbreux-cpu up 7-00:00:00 2 mix cpu[290,293]
private-schaer-gpu up 7-00:00:00 1 mix gpu007
private-ruch-gpu up 7-00:00:00 1 mix gpu033
private-gervasio-gpu up 7-00:00:00 1 mix gpu040
private-gervasio-gpu up 7-00:00:00 1 alloc gpu036
private-cui-gpu up 7-00:00:00 2 mix gpu[009,019]
private-cui-gpu up 7-00:00:00 1 alloc gpu010
private-dpt-gpu up 7-00:00:00 1 mix gpu026
–
HPC Team
Adrien. A, Gaël. R, Yann. S