[2026] Current issues on HPC Cluster

Description

Cluster: Baobab

Login node crashed with a kernel panic. We restarted it.


HPC Team

Status : Resolved :green_circle:

start: 2026-01-02T18:53:00Z
end: 2026-01-05T07:50:00Z

Description

Cluster: Bamboo

Login node crashed with a kernel panic. We restarted it.


HPC Team

Status : Resolved :green_circle:

start: 2025-12-31T07:17:00Z
end: 2026-01-05T07:55:00Z

Description

Cluster: Bamboo

Storage server was down for part of scratch storage.


HPC Team

Status : solved :green_circle:

start: 2025-12-30T20:22:00Z
end: 2026-01-05T09:19:00Z

Description

Cluster: Bamboo

Login1 has crashed, server has been rebooted

tatus : solved :green_circle:

start: 2026-01-12T10:06:00Z
end: 2026-01-11T10:15:00Z

1 Like

Description

Cluster: Yggdrasil

We are experiencing issue with SLURM causing delays in job management. We are actively working to resolve this incident and limit its impact.

Updates

The issue has been resolved and Slurm is back online.

Status : Solved :green_circle:

start: 2026-01-15T10:05:00Z
end: 2026-01-15T10:30:00Z

Description

Cluster: Yggdrasil

We are currently facing a power outage affecting the Yggdrasil cluster, which may result in multiple nodes being unreachable. We will provide an update once power has been fully restored.

Status : Solved :green_circle:

start: 2026-01-19T10:00:00Z
end: 2026-01-19T10:30:00Z

Description

Cluster: all

Since the latest slurm update, some interactive jobs (using srun or salloc) are killed prematurely. For the user it appears as if the job reached its timelimit, but in the admin logs it is indicated that the job was killed due to inactivity timeout. We opened a case at schedMD.

Status : Ongoing :orange_circle:

start: 2025-12-19T10:00:00Z
end:

Description

Cluster: baobab

We had to reboot login1.baobab because systemctl/dbus crashed.

Status : Ongoing :orange_circle:

start: 2026-01-29T10:45:00Z
end:2026-01-29T11:00:00Z