[2025] Current issues on HPC Cluster

Dear users,

We have some problems with login1 on Bamboo, we are investigation to find a solution.

For the time being, running jobs are continuing, but it isn’t possible to login to the login node.

We’ll keep you informed, thank you for your understanding.

Kind regards

edit: problem solved, we’ll add the login node to our monitoring server.

Status : Solved :green_circle:

start: 2025-01-18T10:00:00Z
end: 2025-01-20T08:00:00Z

Dear users,

We are currently experiencing an issue with our storage server (home) that may result in the following error message:

I/O remote error

We are investigating to determine the root cause.

Status : Resolved :green_circle:

start: 2025-01-23T10:00:00Z
end:

Dear users,

Rhis morning, we had an issue with login node on Yggdrasil. Some process crash server but after reboot all is now up and running.

Root cause may be linked to I/O errors on storage.

Sorry for inconvenience.

Status : Resolved :green_circle:

start: 2025-01-27T08:45:00Z
end: 2025-01-27T10:30:00Z

Description

We encountered an issue with OpenOnDemand authentication. While attempting to fix outsider access in collaboration with the Authentication team, the authentication rules were affected. As a result, some users with dual identities (Collaborator/Student) may have experienced account mismatches with the HPC system.

Resolution

The fix has been rolled back, and the issue should no longer be present. We plan to test an alternative solution to allow outsider access to OpenOnDemand.

Status : Resolved :green_circle:

start: 2025-03-12T08:45:00Z
end: 2025-03-13T07:30:00Z

Description

We encountered an issue with Slurm, the database is unreachable resulting errors executing slurm command (sinfo, sacct etc…)

Resolution

Database service has been restarted.

Status : Resolved :green_circle:

start: 2025-03-12T15:00:00Z
end: 2025-03-13T22:30:00Z

Description

We encountered an issue where Baobab login1 was stuck due to a Jobs user. The node became completely unreachable, preventing us from identifying the responsible user.

Resolution

Node has been rebooted

Status : Resolved :green_circle:

2025-03-20T09:20:00Z2025-03-20T09:23:00Z

Description

We encountered an issue with Slurm on Baobab, the database is unreachable resulting errors executing slurm command (sinfo, sacct etc…) and jobs terminating too early with TIMEOUT reason.

Resolution

There was a network issue between slurm and slurmdbd. It is now solved, thanks for your understanding.

Status : Resolved :green_circle:

start: 2025-04-02T12:00:00Z
end: 2025-04-02T14:00:00Z

Description

We encountered an issue with admin servers and Slurm on Bamboo, the database is unreachable resulting errors executing slurm command (sinfo, sacct etc…) and jobs terminating too early with TIMEOUT reason.

Resolution

We are working on it.

Status : solved :green_circle:

start: 2025-04-04T14:00:00Z
start: 2025-04-04T14:54:00Z

Description

Status : solved :green_circle:

start: 2025-04-09T10:00:00Z
start: 2025-04-04T11:50:00Z

Description

We encountered an issue with infiniband network on Bamboo, the home and is unreachable and every node is in drain state.

Resolution

There was an incoherent parameter in one of the configuration file since the latest maintenance and the effect was triggered today. It is now solved.

Status : solved :green_circle:

start: 2025-04-24T09:30:00Z
start: 2025-04-24T12:15:00Z

1 Like

Description

Cluster: Baobab

The login1 node was temporarily unavailable on due to an issue caused by excessive memory usage from a user session.

As a result, the node became unresponsive and could not be accessed for a the period of time. The issue has since been resolved, and the system is back to normal.

Status : solved :green_circle:

start: 2025-04-21T19:00:00Z
end : 2025-04-22T08:20:00Z

Description

Cluster: baobab

An unexpected behavior temporarily impacted the automatic sending of outgoing emails on one of our servers. The issue has been resolved, and email delivery is functioning normally. No email loss has been detected during the incident.

Status : solved :green_circle:

start: 2025-04-29T09:00:00Z
end : 2025-04-30T08:21:00Z