[2024] Current issues on HPC Cluster

Yann.Sagon · January 12, 2024, 8:02am

Yggdrasil: Scratch storage issue

Dear users,

there is an issue on scratch storage on Yggdrasil

Our team is actively working on resolving this problem, and we apologize for any inconvenience caused. We will keep you updated on the progress.

Thank you for your understanding.

Best regards,

update 15.01.2024:

We’ll stop the scratch storage completely until further notice.
We’ll send today the disks to the vendor and they’ll try to rebuild the RAID in their lab.
We’ll increase temporarily the disk quota $HOME on Yggdrasil from current 1TB to 3 TB in order to let you continue to work on the cluster.

update 16.01.2024:

login1.yggdrasil restarted without scratch storage
quota on home storage increased from 1TB to 3TB
vendor received the disks, they are trying to rebuild the RAID. First attempt failed, they are first trying to clone the faulty disk.

update: 26.01.2024

scratch storage restored.
we removed corrupted files. If we removed files belonging to you, you’ll find in your home directory a file named migration with the list of erased files.

Status : Solved

start: 2024-01-12T06:00:00Z
end: 2024-01-26T10:00:00Z

Yann.Sagon · January 15, 2024, 1:00pm

BAOBAB: fast storage down

Dear users,

the fast storage /srv/fast is currently down since the latest maintenance.

We need to go on site to put it back on production, this should be done today.

We apologize for the inconvenience.

Thank you for your understanding.

Best regards,

Status : Solved :

start: 2024-01-10T02:26:00Z
end:2024-01-15T15:00:00Z

Yann.Sagon · January 30, 2024, 12:19pm

Baobab: recent software not working on legacy compute nodes and login node

Dear users,

Many software using the toolchain foss/2023b or GCCcore 13.2.0 aren’t working on legacy compute nodes or login node. The reason is they were compiled on a too recent compute node. We are rebuilding them to solve the issue.

Thanks for your understanding.

Best regards,

Status : Solved

start: 2024-01-12T02:26:00Z
end: 2024-01-30T14:06:00Z

Yann.Sagon · February 9, 2024, 8:06am

Baobab: scratch issue

Dear users,

one of our scratch server crashed due to too many open files. This happens when a buggy user application is opening too many files without closing them appropriately, there is not much we can do about it. If any user suspect his application may be the issue, please contact us.

Thank you for your understanding.

Best regards,

Status : In Resolved

start: 2024-02-08T16:10:00Z
end: 2024-02-09T08:00:00Z

Adrien.Albert · February 20, 2024, 9:56am

Baobab: scratch issue

Dear users,

One of our scratch server crashed due to too Unknown errors. We are hardly working on it and keep you inform about the evolution.

Thank you for your understanding.

Best regards,

Updates: 2024-02-21T12:15:00Z

We have several RaidSet in degraded state on scratch2 server. To avoid any more stress on storage, scratch have been disabled during the rebuilding of the RaidSet.

Before performing any action we have have been suspended all the jobs since 2024-02-20T14:00:00Z until 2024-02-21T11:00:00Z waiting for precise procedure from our Supplier. Then Jobs have been killed, compute and login node restarted.

A temporary scratch directory have been created in each home to hide as much as possible the impact of this issue.

Updates: 2024-02-27T10:45:00Z

scratch storage restored.
No loss of data, all files created in scratch directory during the issue are present in the folder scratch_during_scratch_failure

Status : In Resolved

start: 2024-02-20T09:45:00Z
end: 2024-02-27T10:45:00Z

Yann.Sagon · March 21, 2024, 9:52am

Yggdrasil login node issue

Dear users,

We had a DNS issue on Yggdrasil. This prevented to use salloc and to connect to the allocated compute node.

Status : Resolved

start: 2024-03-20T15:00:00Z
end: 2024-03-21T09:50:00Z

Adrien.Albert · April 24, 2024, 7:47am

Yggdrasil login node crash

Dear users,

The connection node was blocked and remained unavailable for a few minutes.

The reason is currently unknown, but points strongly to a user process running on the connection node. We remind you that it is strictly and explicitly forbidden to run tasks on connection nodes.

If you think this is the cause of the problem, please contact us so that we can help you.

Status : Resolved

start: 2024-04-24T07:25:00Z
end: 2024-04-24T08:37:00Z

Gael.Rossignol · May 17, 2024, 8:08am

Baobab infiniband leaf12 crash

Dear users,

Yesterday we encounter some problems on the infiniband swich leaf12 impacting all servers connected to this switch ( gpu[030-045,047],cpu[312-321,332] ) acessing to storage home and scratch.

This leaf has been rebooted and all traffic is now working fine.

Sorry for inconvenience,

Status : Resolved

start: 2024-05-16T13:00:00Z
end: 2024-05-16T14:00:00Z

Yann.Sagon · June 3, 2024, 7:08am

Baobab: Slurm issue

Dear users,

Our slurm database (slurmdbd) crashed this week-end due to lack of disk space. We restored it quickly but we didn’t noticed slurm controller crashed as well.

Both services are now restored.

Thank you for your understanding.

Best regards,

Status : Resolved

Start: 2024-06-01T06:00:00Z
End:2024-06-03T07:00:00Z

Adrien.Albert · June 4, 2024, 2:05pm

Yggdrasil : Login node stuck

Dear users,

The login node (login1.yggdrasil) is blocked due to a computation process.

The server has been restarted and is available again.

Thank you for your understanding.

Best regards,

Status : Resolved

Start: 2024-06-04T13:40:00Z
End:2024-06-04T14:05:00Z

Adrien.Albert · June 11, 2024, 12:23pm

Yggdrasil : Login node stuck

Dear users,

The login node (login1.yggdrasil) is blocked due to a computation process.

The server has been restarted and is available again.

Thank you for your understanding.

Best regards,

Status : Resolved

Start: 2024-06-11T07:30:00Z
End:2024-06-11T12:15:00Z

Adrien.Albert · June 17, 2024, 8:18am

Baobab : Storage latency

Dear users,

We are experiencing some latency on Baobab storage (home and scratch); the root cause is currently unknown. We are investigating

Thank you for your understanding.

Best regards,

Status : In progress

Start: 2024-06-19T12:00:00Z
End:

Adrien.Albert · June 19, 2024, 12:51pm

All Clusters: GIO mount storage from NASAC

Dear User,

We have recently discovered issues with mounting storage spaces using the CIFS/SMB protocol from NASAC.

The problem comes from the update of Mellanox suite of packages, which manages our Infiniband (fast) network. Unfortunately, these packages disable the CIFS module, preventing CIFS/SMB storage mounts.

Since the Mellanox suite is crucial for the cluster’s operation, we cannot remove it. Currently, Mellanox doesn’t provides a workaround.

We are actively investigating a solution on our end to ensure you have the best possible experience. We appreciate your patience and will keep you updated on our progress.

Thank you for your understanding.

Update:

2024-06-24T10:30:00Z→2024-06-24T11:45:00Z

[HPC][Boabab] Monday 2024-06-24:12:30:00 Action on login node - #5 by Adrien.Albert

2024-10-17T08:00:00Z

During the maintenance the patch cifs have been deployed.

(yggdrasil)-[alberta@login1 ~]$ gio  mount   smb://isis.unige.ch/nasac/hpc_exchange/backup < .credentials
Password required for share nasac on isis.unige.ch
User [alberta]: Domain [SAMBA]: Password: 

(yggdrasil)-[alberta@login1 ~]$ ps -edf |grep gvfs
alberta   594688       1  0 09:59 ?        00:00:00 /usr/libexec/gvfsd
alberta   594693       1  0 09:59 ?        00:00:00 /usr/libexec/gvfsd-fuse /run/user/401775/gvfs -f -o big_writes

(yggdrasil)-[alberta@login1 ~]$ ls /run/user/401775/gvfs
'smb-share:server=isis.unige.ch,share=nasac

(yggdrasil)-[alberta@login1 ~]$ ls /run/user/401775/gvfs/smb-share\:server\=isis.unige.ch\,share\=nasac/hpc_exchange/backup/
titi  toto

Note that the module has been deactivated by mellanox (driver fiber provider) for a good reason, so we’re not immune to unexpected behaviour.

2024-12-04T23:00:00Z
Since Rocky9 was deployed on Bamboo during the last maintenance, the module now seems to be available again. You should be able to mount your NASAC share on both the login nodes and compute nodes.

Rocky9 will also be deployed on Baobab and Yggdrasil during the next maintenances.

Status: Baobab: | Yggdrasil: | Bamboo:

start: 2024-06-13T22:00:00Z
Patched: 2024-10-17T08:00:00Z

Gael.Rossignol · June 24, 2024, 11:55am

Baobab : Storage scratch down

Dear users,

We are experiencing some problems on Baobab scratch storage; The root cause was the number of files open at the same time on the scratch. We had some issue to reboot server and get service up again but all is now resolved.

Thank you for your understanding.

Best regards,

Status : Solved

Start: 2024-06-24T09:00:00Z
End: 2024-06-24T16:30:00Z

Yann.Sagon · July 2, 2024, 1:53pm

BAMBOO: Login node not reachable from outside of the university

Dear users,

it is not possible to connect to Bamboo login node from outside of the university. We’ll solve the issue.

In the meantime your options are:

use the UNIGE VPN
connect through Baobab or Yggdrasil login node

Thank you for your understanding.

Best regards,

Status : Solved

start: 2024-07-01T09:30:00Z
end: 2024-07-04T07:00:00Z

Adrien.Albert · July 3, 2024, 5:42pm

Baobab: DNS issue

Dear users,

We are currently experiencing DNS issue on Baobab, some of you have already encountered error messages as a result:

srun: error: _fwd_tree_get_addr: can't find address for host cpu002, check slurm.conf
srun: error: Task launch for StepId=11157586.0 failed on node cpu002: Can't find an address, check slurm.conf
srun: error: Application launch failed: Can't find an address, check slurm.conf
srun: Job step aborted

We are working on it to resolve it as soon as possible.

We apologize for any inconveniance caused

Thank you for your understanding.

Status : Solved

start: 2024-07-01T16:30:00Z
end:2024-07-04T07:45:00Z

Adrien.Albert · July 10, 2024, 9:43am

ALL Cluster: DPNC beegfs stuck

Dear DPNC users,

DPNC beegfs /dpnc/beegfs storage is currently blocked, resulting in all compute nodes, whose jobs uses this share, being drained from production. We have contacted the technical manager of this storage and are awaiting a solution.

In the meantime, we kindly ask you to wait before starting any job on this storage, and to cancel those currently using it.

Thank you for your understanding.

Status : Solved

start: 2024-07-10T07:00:00Z
end: 2024-07-10T10:03:00Z