[2024] Current issues on HPC Cluster

Yggdrasil: Scratch storage issue

Dear users,

there is an issue on scratch storage on Yggdrasil

  • Our team is actively working on resolving this problem, and we apologize for any inconvenience caused. We will keep you updated on the progress.

Thank you for your understanding.

Best regards,

update 15.01.2024:

  • We’ll stop the scratch storage completely until further notice.
  • We’ll send today the disks to the vendor and they’ll try to rebuild the RAID in their lab.
  • We’ll increase temporarily the disk quota $HOME on Yggdrasil from current 1TB to 3 TB in order to let you continue to work on the cluster.

update 16.01.2024:

  • login1.yggdrasil restarted without scratch storage
  • quota on home storage increased from 1TB to 3TB
  • vendor received the disks, they are trying to rebuild the RAID. First attempt failed, they are first trying to clone the faulty disk.

update: 26.01.2024

  • scratch storage restored.
  • we removed corrupted files. If we removed files belonging to you, you’ll find in your home directory a file named migration with the list of erased files.

Status : Solved :green_circle:

start: 2024-01-12T06:00:00Z
end: 2024-01-26T10:00:00Z

3 Likes

BAOBAB: fast storage down

Dear users,

the fast storage /srv/fast is currently down since the latest maintenance.

We need to go on site to put it back on production, this should be done today.

We apologize for the inconvenience.

Thank you for your understanding.

Best regards,

Status : Solved :green_circle: :

start: 2024-01-10T02:26:00Z
end:2024-01-15T15:00:00Z

Baobab: recent software not working on legacy compute nodes and login node

Dear users,

Many software using the toolchain foss/2023b or GCCcore 13.2.0 aren’t working on legacy compute nodes or login node. The reason is they were compiled on a too recent compute node. We are rebuilding them to solve the issue.

Thanks for your understanding.

Best regards,

Status : Solved :white_check_mark:

start: 2024-01-12T02:26:00Z
end: 2024-01-30T14:06:00Z

1 Like

Baobab: scratch issue

Dear users,

one of our scratch server crashed due to too many open files. This happens when a buggy user application is opening too many files without closing them appropriately, there is not much we can do about it. If any user suspect his application may be the issue, please contact us.

Thank you for your understanding.

Best regards,

Status : In Resolved :white_check_mark:

start: 2024-02-08T16:10:00Z
end: 2024-02-09T08:00:00Z

Baobab: scratch issue

Dear users,

One of our scratch server crashed due to too Unknown errors. We are hardly working on it and keep you inform about the evolution.

Thank you for your understanding.

Best regards,

Updates: 2024-02-21T12:15:00Z

We have several RaidSet in degraded state on scratch2 server. To avoid any more stress on storage, scratch have been disabled during the rebuilding of the RaidSet.

Before performing any action we have have been suspended all the jobs since 2024-02-20T14:00:00Z until 2024-02-21T11:00:00Z waiting for precise procedure from our Supplier. Then Jobs have been killed, compute and login node restarted.

A temporary scratch directory have been created in each home to hide as much as possible the impact of this issue.

Updates: 2024-02-27T10:45:00Z

  • scratch storage restored.
  • No loss of data, all files created in scratch directory during the issue are present in the folder scratch_during_scratch_failure

Status : In Resolved :white_check_mark:

start: 2024-02-20T09:45:00Z
end: 2024-02-27T10:45:00Z

6 Likes

Yggdrasil login node issue

Dear users,

We had a DNS issue on Yggdrasil. This prevented to use salloc and to connect to the allocated compute node.

Status : Resolved :white_check_mark:

start: 2024-03-20T15:00:00Z
end: 2024-03-21T09:50:00Z
[/quote]

Yggdrasil login node crash

Dear users,

The connection node was blocked and remained unavailable for a few minutes.

The reason is currently unknown, but points strongly to a user process running on the connection node. We remind you that it is strictly and explicitly forbidden to run tasks on connection nodes.

If you think this is the cause of the problem, please contact us so that we can help you.

Status : Resolved :white_check_mark:

start: 2024-04-24T07:25:00Z
end: 2024-04-24T08:37:00Z