We encountered an issue with OpenOnDemand authentication. While attempting to fix outsider access in collaboration with the Authentication team, the authentication rules were affected. As a result, some users with dual identities (Collaborator/Student) may have experienced account mismatches with the HPC system.
Resolution
The fix has been rolled back, and the issue should no longer be present. We plan to test an alternative solution to allow outsider access to OpenOnDemand.
We encountered an issue where Baobab login1 was stuck due to a Jobs user. The node became completely unreachable, preventing us from identifying the responsible user.
We encountered an issue with Slurm on Baobab, the database is unreachable resulting errors executing slurm command (sinfo, sacct etc…) and jobs terminating too early with TIMEOUT reason.
Resolution
There was a network issue between slurm and slurmdbd. It is now solved, thanks for your understanding.
We encountered an issue with admin servers and Slurm on Bamboo, the database is unreachable resulting errors executing slurm command (sinfo, sacct etc…) and jobs terminating too early with TIMEOUT reason.
The login1 node was temporarily unavailable on due to an issue caused by excessive memory usage from a user session.
As a result, the node became unresponsive and could not be accessed for a the period of time. The issue has since been resolved, and the system is back to normal.
Status : solved
start: 2025-04-21T19:00:00Z
end : 2025-04-22T08:20:00Z
An unexpected behavior temporarily impacted the automatic sending of outgoing emails on one of our servers. The issue has been resolved, and email delivery is functioning normally. No email loss has been detected during the incident.
Status : solved
start: 2025-04-29T09:00:00Z
end : 2025-04-30T08:21:00Z
We’ve identified a configuration issue with the /tmp directory on login1, which currently limits available space to only 15 GB for all users. This is causing latency and write errors for some process on login node.
To resolve this, we will perform a short maintenance and reboot the node at 13:30 today.
This intervention should be quick and will restore proper /tmp storage capacity.
Thank you for your understanding,
Update:
2025-06-09T22:00:00Z Maintenance started at 14:00 PM and finished at 14:30 PM without any issue.
Status : Resolved
start: 2025-06-10T06:00:00Z
end :2025-06-10T12:30:00Z
Under specific circumstances, it is possible that you will be able to use a resource such as a GPU that is already associated with another user. This may happen when you request resources using salloc and then connect to the compute node using ssh later. Alternatively, Slurm may allocate you a resource that is already in use by a previous job. This issue has already been resolved, but it will only be fully resolved once all resources are released by the running jobs.
Status : solved
start: 2025-06-25T07:37:00Z
end : 2025-06-25T11:45:00Z
We are currently experiencing an issue with the Bamboo scratch and Home storage. This share is temporarily unavailable.
Our team is actively investigating the problem, and we will provide an update as soon as more information becomes available.
We apologize for the inconvenience and appreciate your patience.
Status : alert
start: 2025-07-07T14:50:00Z
end : 2025-07-07T15:10:00Z
Update
Filesystem impacted scratch:
start: 2025-07-07T20:00:00Z
end : 2025-07-09T14:20:00Z
The scratch has been re-activated. However, despite our investigation, the root cause of the issue has not yet been determined. We remain vigilant and will continue to monitor the situation closely.