[2026] Current issues on HPC Cluster

Yann.Sagon · January 5, 2026, 7:51am

Description

Cluster: Baobab

Login node crashed with a kernel panic. We restarted it.

–
HPC Team

Status : Resolved

start: 2026-01-02T18:53:00Z
end: 2026-01-05T07:50:00Z

Yann.Sagon · January 5, 2026, 8:31am

Description

Cluster: Bamboo

Login node crashed with a kernel panic. We restarted it.

–
HPC Team

Status : Resolved

start: 2025-12-31T07:17:00Z
end: 2026-01-05T07:55:00Z

Yann.Sagon · January 5, 2026, 8:33am

Description

Cluster: Bamboo

Storage server was down for part of scratch storage.

–
HPC Team

Status : solved

start: 2025-12-30T20:22:00Z
end: 2026-01-05T09:19:00Z

Adrien.Albert · January 12, 2026, 10:15am

Description

Cluster: Bamboo

Login1 has crashed, server has been rebooted

tatus : solved

start: 2026-01-12T10:06:00Z
end: 2026-01-11T10:15:00Z

Adrien.Albert · January 15, 2026, 9:44am

Description

Cluster: Yggdrasil

We are experiencing issue with SLURM causing delays in job management. We are actively working to resolve this incident and limit its impact.

Updates

The issue has been resolved and Slurm is back online.

Status : Solved

start: 2026-01-15T10:05:00Z
end: 2026-01-15T10:30:00Z

Gael.Rossignol · January 19, 2026, 10:21am

Description

Cluster: Yggdrasil

We are currently facing a power outage affecting the Yggdrasil cluster, which may result in multiple nodes being unreachable. We will provide an update once power has been fully restored.

Status : Solved

start: 2026-01-19T10:00:00Z
end: 2026-01-19T10:30:00Z

Yann.Sagon · January 19, 2026, 1:22pm

Description

Cluster: all

Since the latest slurm update, some interactive jobs (using srun or salloc) are killed prematurely. For the user it appears as if the job reached its timelimit, but in the admin logs it is indicated that the job was killed due to inactivity timeout. We opened a case at schedMD.

Status : Ongoing

start: 2025-12-19T10:00:00Z
end:

Yann.Sagon · January 29, 2026, 11:01am

Description

Cluster: baobab

We had to reboot login1.baobab because systemctl/dbus crashed.

Status : Solved

start: 2026-01-29T10:45:00Z
end:2026-01-29T11:00:00Z

Gael.Rossignol · February 12, 2026, 8:26am

Description

Cluster: yggdrasil

Due to a cut in the optical fiber, the IT and telephone networks are unavailable at the Ecogia and Sauverny sites. The OCSIN technicians who manage our optical fibers have been informed. No resolution time has been communicated for now.

Edit:

Incident is now solved. Please not that cluster was not reachable but job were running.

Status : Resolved

start: 2026-02-11T21:38:00Z
end: 2026-02-12T11:45:00Z

Adrien.Albert · February 23, 2026, 11:00am

Dear Users,

Cluster: Yggdrasil

Description

An issue is currently affecting the Home storage on the Yggdrasil cluster.

Some users may encounter the following message:
“I/O remote error”

Possible disruptions accessing the Home directory
Risk of input/output errors during commands or job execution
The cluster remains available, but some operations may fail

We are working to restore normal service as soon as possible.
An update will be posted as soon as we have more information.

We apologize for the delayed response.
Our team is operating with reduced staff this week and is focused on preparing the Bamboo maintenance starting tomorrow.

Status: Resolved

Start: 2026-02-23T08:00:00Z
End: 2026-02-23T15:00:00Z

Adrien.Albert · February 27, 2026, 9:39am

Dear users,

Cluster: Bamboo

Description

Following the recent maintenance on Bamboo, we upgraded the BeeGFS client from version 7.4.6 to 7.4.7.

After this update, we observed a change in behavior of the beegfs-ctl --getquota command.

When running:

(bamboo)-[alberta@login1 ~]$ beegfs-get-quota-home-scratch.sh -u alberta
home dir: /home/users/a/alberta
scratch dir: /srv/beegfs/scratch/users/a/alberta

          user/group                 ||           size          ||    chunk files
  storage     |   name        |  id  ||    used    |    hard    ||  used   |  hard
  ----------------------------|------||------------|------------||---------|---------
Unable to resolve given name: --connTCPRecvBufSize
home        | 
Unable to resolve given name: --connTCPRecvBufSize
scratch

the command unexpectedly returns the following error:

Unable to resolve given name: --connTCPRecvBufSize

An issue has been opened with the BeeGFS team, and we are working to resolve this case as soon as possible.

Status: WorkArround applied

Start: 2026-02-26T15:00:00Z

End: 2026-02-27T10:00:00Z

Adrien.Albert · March 2, 2026, 12:59pm

Dear users,

Cluster: Bamboo

Description

Dear HPC users,

Following the last Bamboo maintenance, the interactive RStudio application on Open OnDemand was not functioning correctly, causing sessions to terminate immediately after launch.

We have identified the cause of this behavior and applied a fix. The RStudio app should now operate as expected.

We also strongly suspect that this incident is related to a new behavior introduced by the recent upgrade of BeeGFS to version 7.4.7.

If you continue to experience any issues, please let us know.

Status: Resolved

Start: 2026-02-26T15:00:00Z
End: 2026-03-02T10:00:00Z

Gael.Rossignol · March 19, 2026, 2:09pm

Cluster: ALL

Description

Dear users,

This message in only to inform that we need to update slurmdbd to latest version, this will no have any impact on the usage of clusters but some commands to monitor like sacct will be unavailable during few minutes.

Sorry for inconvenience,

Status: Resolved

Start: 2026-03-19T14:00:00Z
End: 2026-03-19T16:00:00Z

Yann.Sagon · March 27, 2026, 2:41pm

Cluster: Baobab

Description

We we must perform an immediate emergency shutdown of the Baobab HPC cluster due to a critical cooling issue in the datacenter.

This action is required to protect the hardware and ensure the safety and integrity of your data.

All currently running jobs will unfortunately be stopped.

Our teams are actively working with the datacenter staff to resolve the situation as quickly as possible.

The Baobab HPC Team

Status: Resolved

Start: 2026-03-27T14:00:00Z
End: 2026-03-27T17:00:00Z

Adrien.Albert · April 5, 2026, 10:18am

Cluster: Bamboo

Description

An incident is currently ongoing on the Bamboo cluster. One of our storage server is experiencing an issue impacting the scratch and home storage.

Status: Solved

Start: 2026-04-02T20:24:00Z
End: 2026-04-07T06:00:00Z