[2026] Current issues on HPC Cluster

Description

Cluster: Baobab

Login node crashed with a kernel panic. We restarted it.

–
HPC Team

Status : Resolved :green_circle:

start: 2026-01-02T18:53:00Z
end: 2026-01-05T07:50:00Z

Description

Cluster: Bamboo

Login node crashed with a kernel panic. We restarted it.

–
HPC Team

Status : Resolved :green_circle:

start: 2025-12-31T07:17:00Z
end: 2026-01-05T07:55:00Z

Description

Cluster: Bamboo

Storage server was down for part of scratch storage.

–
HPC Team

Status : solved :green_circle:

start: 2025-12-30T20:22:00Z
end: 2026-01-05T09:19:00Z

Description

Cluster: Bamboo

Login1 has crashed, server has been rebooted

tatus : solved :green_circle:

start: 2026-01-12T10:06:00Z
end: 2026-01-11T10:15:00Z

1 Like

Description

Cluster: Yggdrasil

We are experiencing issue with SLURM causing delays in job management. We are actively working to resolve this incident and limit its impact.

Updates

The issue has been resolved and Slurm is back online.

Status : Solved :green_circle:

start: 2026-01-15T10:05:00Z
end: 2026-01-15T10:30:00Z

Description

Cluster: Yggdrasil

We are currently facing a power outage affecting the Yggdrasil cluster, which may result in multiple nodes being unreachable. We will provide an update once power has been fully restored.

Status : Solved :green_circle:

start: 2026-01-19T10:00:00Z
end: 2026-01-19T10:30:00Z

Description

Cluster: all

Since the latest slurm update, some interactive jobs (using srun or salloc) are killed prematurely. For the user it appears as if the job reached its timelimit, but in the admin logs it is indicated that the job was killed due to inactivity timeout. We opened a case at schedMD.

Status : Ongoing :orange_circle:

start: 2025-12-19T10:00:00Z
end:

Description

Cluster: baobab

We had to reboot login1.baobab because systemctl/dbus crashed.

Status : Solved :green_circle:

start: 2026-01-29T10:45:00Z
end:2026-01-29T11:00:00Z

Description

Cluster: yggdrasil

Due to a cut in the optical fiber, the IT and telephone networks are unavailable at the Ecogia and Sauverny sites. The OCSIN technicians who manage our optical fibers have been informed. No resolution time has been communicated for now.

Edit:

Incident is now solved. Please not that cluster was not reachable but job were running.

Status : Resolved :green_circle:

start: 2026-02-11T21:38:00Z
end: 2026-02-12T11:45:00Z

Dear Users,

Cluster: Yggdrasil

Description

An issue is currently affecting the Home storage on the Yggdrasil cluster.

Some users may encounter the following message:
β€œI/O remote error”

  • Possible disruptions accessing the Home directory
  • Risk of input/output errors during commands or job execution
  • The cluster remains available, but some operations may fail

We are working to restore normal service as soon as possible.
An update will be posted as soon as we have more information.

We apologize for the delayed response.
Our team is operating with reduced staff this week and is focused on preparing the Bamboo maintenance starting tomorrow.

Status: Resolved :green_circle:

Start: 2026-02-23T08:00:00Z
End: 2026-02-23T15:00:00Z

Dear users,

Cluster: Bamboo

Description

Following the recent maintenance on Bamboo, we upgraded the BeeGFS client from version 7.4.6 to 7.4.7.

After this update, we observed a change in behavior of the beegfs-ctl --getquota command.

When running:

(bamboo)-[alberta@login1 ~]$ beegfs-get-quota-home-scratch.sh -u alberta
home dir: /home/users/a/alberta
scratch dir: /srv/beegfs/scratch/users/a/alberta

          user/group                 ||           size          ||    chunk files
  storage     |   name        |  id  ||    used    |    hard    ||  used   |  hard
  ----------------------------|------||------------|------------||---------|---------
Unable to resolve given name: --connTCPRecvBufSize
home        | 
Unable to resolve given name: --connTCPRecvBufSize
scratch  

the command unexpectedly returns the following error:

Unable to resolve given name: --connTCPRecvBufSize

An issue has been opened with the BeeGFS team, and we are working to resolve this case as soon as possible.

Status: WorkArround applied :blue_circle:

Start: 2026-02-26T15:00:00Z

End: 2026-02-27T10:00:00Z

Dear users,

Cluster: Bamboo

Description

Dear HPC users,

Following the last Bamboo maintenance, the interactive RStudio application on Open OnDemand was not functioning correctly, causing sessions to terminate immediately after launch.

We have identified the cause of this behavior and applied a fix. The RStudio app should now operate as expected.

We also strongly suspect that this incident is related to a new behavior introduced by the recent upgrade of BeeGFS to version 7.4.7.

If you continue to experience any issues, please let us know.

Status: Resolved :green_circle:

Start: 2026-02-26T15:00:00Z
End: 2026-03-02T10:00:00Z