[2025] Current issues on HPC Cluster

Dear users,

We have some problems with login1 on Bamboo, we are investigation to find a solution.

For the time being, running jobs are continuing, but it isn’t possible to login to the login node.

We’ll keep you informed, thank you for your understanding.

Kind regards

edit: problem solved, we’ll add the login node to our monitoring server.

Status : Solved :green_circle:

start: 2025-01-18T10:00:00Z
end: 2025-01-20T08:00:00Z

Dear users,

We are currently experiencing an issue with our storage server (home) that may result in the following error message:

I/O remote error

We are investigating to determine the root cause.

Status : Resolved :green_circle:

start: 2025-01-23T10:00:00Z
end:

Dear users,

Rhis morning, we had an issue with login node on Yggdrasil. Some process crash server but after reboot all is now up and running.

Root cause may be linked to I/O errors on storage.

Sorry for inconvenience.

Status : Resolved :green_circle:

start: 2025-01-27T08:45:00Z
end: 2025-01-27T10:30:00Z

Description

We encountered an issue with OpenOnDemand authentication. While attempting to fix outsider access in collaboration with the Authentication team, the authentication rules were affected. As a result, some users with dual identities (Collaborator/Student) may have experienced account mismatches with the HPC system.

Resolution

The fix has been rolled back, and the issue should no longer be present. We plan to test an alternative solution to allow outsider access to OpenOnDemand.

Status : Resolved :green_circle:

start: 2025-03-12T08:45:00Z
end: 2025-03-13T07:30:00Z

Description

We encountered an issue with Slurm, the database is unreachable resulting errors executing slurm command (sinfo, sacct etc…)

Resolution

Database service has been restarted.

Status : Resolved :green_circle:

start: 2025-03-12T15:00:00Z
end: 2025-03-13T22:30:00Z

Description

We encountered an issue where Baobab login1 was stuck due to a Jobs user. The node became completely unreachable, preventing us from identifying the responsible user.

Resolution

Node has been rebooted

Status : Resolved :green_circle:

2025-03-20T09:20:00Z2025-03-20T09:23:00Z

Description

We encountered an issue with Slurm on Baobab, the database is unreachable resulting errors executing slurm command (sinfo, sacct etc…) and jobs terminating too early with TIMEOUT reason.

Resolution

There was a network issue between slurm and slurmdbd. It is now solved, thanks for your understanding.

Status : Resolved :green_circle:

start: 2025-04-02T12:00:00Z
end: 2025-04-02T14:00:00Z

Description

We encountered an issue with admin servers and Slurm on Bamboo, the database is unreachable resulting errors executing slurm command (sinfo, sacct etc…) and jobs terminating too early with TIMEOUT reason.

Resolution

We are working on it.

Status : solved :green_circle:

start: 2025-04-04T14:00:00Z
start: 2025-04-04T14:54:00Z

Description

Status : solved :green_circle:

start: 2025-04-09T10:00:00Z
start: 2025-04-04T11:50:00Z

Description

We encountered an issue with infiniband network on Bamboo, the home and is unreachable and every node is in drain state.

Resolution

There was an incoherent parameter in one of the configuration file since the latest maintenance and the effect was triggered today. It is now solved.

Status : solved :green_circle:

start: 2025-04-24T09:30:00Z
start: 2025-04-24T12:15:00Z

1 Like

Description

Cluster: Baobab

The login1 node was temporarily unavailable on due to an issue caused by excessive memory usage from a user session.

As a result, the node became unresponsive and could not be accessed for a the period of time. The issue has since been resolved, and the system is back to normal.

Status : solved :green_circle:

start: 2025-04-21T19:00:00Z
end : 2025-04-22T08:20:00Z

Description

Cluster: baobab

An unexpected behavior temporarily impacted the automatic sending of outgoing emails on one of our servers. The issue has been resolved, and email delivery is functioning normally. No email loss has been detected during the incident.

Status : solved :green_circle:

start: 2025-04-29T09:00:00Z
end : 2025-04-30T08:21:00Z

Description

Cluster: multi cluster

We are reinstalling slurmdbd and for this reason commands such as sacct won’t work during the reinstallation. Thanks for your understanding.

Status : solved :green_circle:

start: 2025-05-26T13:17:00Z
end : 2025-05-26T14:10:00Z

Description

Cluster: Boabab

We’ve identified a configuration issue with the /tmp directory on login1, which currently limits available space to only 15 GB for all users. This is causing latency and write errors for some process on login node.

To resolve this, we will perform a short maintenance and reboot the node at 13:30 today.

This intervention should be quick and will restore proper /tmp storage capacity.

Thank you for your understanding,

Update:

2025-06-09T22:00:00Z Maintenance started at 14:00 PM and finished at 14:30 PM without any issue.

Status : Resolved :green_circle:

start: 2025-06-10T06:00:00Z
end :2025-06-10T12:30:00Z

Description

Cluster: multi cluster

Under specific circumstances, it is possible that you will be able to use a resource such as a GPU that is already associated with another user. This may happen when you request resources using salloc and then connect to the compute node using ssh later. Alternatively, Slurm may allocate you a resource that is already in use by a previous job. This issue has already been resolved, but it will only be fully resolved once all resources are released by the running jobs.

Status : solved :green_circle:

start: 2025-06-25T07:37:00Z
end : 2025-06-25T11:45:00Z

Description

We are currently experiencing an issue with the Bamboo scratch and Home storage. This share is temporarily unavailable.

Our team is actively investigating the problem, and we will provide an update as soon as more information becomes available.

We apologize for the inconvenience and appreciate your patience.

Status : alert :orange_circle:

start: 2025-07-07T14:50:00Z
end : 2025-07-07T15:10:00Z

Update

Filesystem impacted scratch:
start: 2025-07-07T20:00:00Z
end : 2025-07-09T14:20:00Z

The scratch has been re-activated. However, despite our investigation, the root cause of the issue has not yet been determined. We remain vigilant and will continue to monitor the situation closely.