[2023] Current issues on HPC Cluster

Adrien.Albert · January 19, 2023, 9:30am

Title: Slurm Controller Connectivity Issue

Hello everyone,

We wanted to inform you that we are currently experiencing an issue with our Slurm controller. When attempting to contact the controller, users may receive the message “Unable to contact slurm controller (connect failure)”.

We apologize for the inconvenience and appreciate your patience as we work to resolve this issue.

Best regards,

Adrien.Albert · February 22, 2023, 10:07am

2023-02-22T10:07:00Z

Dear HPC User,

We regret to inform you that there is a cooling issue in DataCenter Dufour which has led us to reserve all nodes in order to avoid any new job starting.
This step has been taken to ensure the preservation of your running jobs.

We have taken this measure with the hope that it will be sufficient to reduce the temperature and resolve the issue.
However, if the situation worsens, we may be forced to shut down all the machines and terminate your running jobs.

Please be assured that we are working diligently to address this issue and minimize any disruptions that may be caused.

We will keep you informed of the progress of the situation.

We apologize for any inconvenience this may cause and appreciate your patience and understanding.

Best regards,

2023-02-22T15:00:00Z

The situation is stabilized and no further action impacting running jobs is expected (for now).

A technical team is currently working on the problem to solve it as soon as possible. The intervention should be completed tomorrow morning (while waiting for a spare part) and our services should resume their normal operation shortly after.

2023-02-23T13:23:00Z

The situation on Baobab has not changed, job submissions are suspended until further notice.

The login node and storage remains available for file transfer.

Good News: The maintenance of Yggdrasil is now complete.

2023-02-27T14:20:00Z

The situation on Baobab has not changed, job submissions are suspended until further notice. However we hope a back into production for this afternoon.

2023-02-28T09:30:00Z

The repair is taking longer than expected, we hope to have the datacenter operational by the end of the week.

2023-03-01T09:15:00Z

Dear User,

We are pleased to inform you that our technical team has successfully resolved the cooling issue in our DataCenter Dufour. However, before we fully resume our production, we will need to test the stability of our services by progressively rebooting nodes to ensure the cooling system is fully operational.
Here the nodelist back into production:

(baobab)-[root@admin1 ~]$ sinfo -n cpu[004,011,019-020,061,065,167,173,176-177,182-183,185,190-191,194,196-197,207,222,243,253,255,274,290,293],gpu[007,009-010,019,026,033,036,040] 
PARTITION                   AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug-cpu*                     up      15:00      1   idle cpu004
public-cpu                     up 4-00:00:00      4  alloc cpu[011,019-020,243]
shared-cpu                     up   12:00:00      9    mix cpu[167,183,190,194,197,253,274,290,293]
shared-cpu                     up   12:00:00     16  alloc cpu[011,019-020,061,065,173,176-177,182,185,191,196,207,222,243,255]
shared-gpu                     up   12:00:00      6    mix gpu[007,009,019,026,033,040]
shared-gpu                     up   12:00:00      2  alloc gpu[010,036]
private-wesolowski-cpu         up 7-00:00:00      1    mix cpu197
private-wesolowski-cpu         up 7-00:00:00      2  alloc cpu[185,196]
private-lehmann-cpu            up 7-00:00:00      1  alloc cpu061
private-dpt-cpu                up 7-00:00:00      1  alloc cpu065
private-cui-cpu                up 7-00:00:00      2    mix cpu[167,190]
private-cui-cpu                up 7-00:00:00      2  alloc cpu[191,207]
private-kruse-cpu              up 7-00:00:00      1    mix cpu194
private-gap-cpu                up 7-00:00:00      1    mix cpu253
private-gap-cpu                up 7-00:00:00      2  alloc cpu[222,255]
private-hepia-cpu              up 7-00:00:00      1    mix cpu183
private-hepia-cpu              up 7-00:00:00      4  alloc cpu[173,176-177,182]
private-gervasio-cpu           up 7-00:00:00      1    mix cpu274
private-salbreux-cpu           up 7-00:00:00      2    mix cpu[290,293]
private-schaer-gpu             up 7-00:00:00      1    mix gpu007
private-ruch-gpu               up 7-00:00:00      1    mix gpu033
private-gervasio-gpu           up 7-00:00:00      1    mix gpu040
private-gervasio-gpu           up 7-00:00:00      1  alloc gpu036
private-cui-gpu                up 7-00:00:00      2    mix gpu[009,019]
private-cui-gpu                up 7-00:00:00      1  alloc gpu010
private-dpt-gpu                up 7-00:00:00      1    mix gpu026

–
HPC Team
Adrien. A, Gaël. R, Yann. S

Yann.Sagon · March 22, 2023, 3:58pm

Dear users,

Bad news, the cooling issue is back in the Dufour datacenter.

The technical team is investigating. Right now, to mitigate the issue we have prevented all new jobs to start on the Baobab cluster. If this measure isn’t enough, we’ll be forced to kill jobs, hope they solve the issue in the meantime.

Thanks for your understanding.

HPC team

Update 2023-03-23T11:00:00Z The technical team in charge of the building identified the faulty component and removed it. They’ll fix this part next week, until then the cluster is fully on duty and we monitor the temperature closely.

Yann.Sagon · April 17, 2023, 12:26pm

Dear users,

we had an issue with the scratch storage on Yggdrasil.

Symptoms:

cannot access *** : Input/output error

duration: 2023-04-16T13:20:00Z→2023-04-17T09:09:00Z

Fixed.

Yann.Sagon · April 17, 2023, 2:58pm

Dear users,

we have an issue with scratch on Baobab.

We reached the max number of files on the scratch storage which is ~700M.

As a quick workaround, we start cleanup some of data from who don’t have an active account anymore.

It is important as well that you erased frequently your unneeded files, specially if you have a lot of them, size doesn’t matter in this case.

start: 2023-04-16T22:00:00Z
end: ??

update: we have now more than 3M files free, we continue the cleanup.

Adrien.Albert · April 28, 2023, 6:13pm

[Baobab] 2023-04-28T13:00:00Z

correction expected on 2023-05-01T22:00:00Z

Primary information

Username: ALL
Cluster: Baobab

Description

Since The maintenance on baobab loading module through slurm sbatch does not work.
The issue doesn’t happens when using salloc.

Steps to Reproduce

(baobab)-[sagon@login2 modules]$ sbatch --wrap "ml GCC/12.2.0; which gcc"
Submitted batch job 582485
(baobab)-[sagon@login2 modules]$ cat slurm-582485.out
/var/spool/slurmd/job582485/slurm_script: line 4: ml: command not found
/usr/bin/gcc

WorkArround

Load the wanted module on login2 before launching your job

or

add the following line after all the #SBATCH pgramas in your sbatch script: . /etc/profile.d/modules.sh (yes there is a dot and a space in front of the line)

or

transform the very first line of your sbatch script to be #!/bin/sh -l

And launch your job

Example (option 1):

(baobab)-[alberta@login2 ~]$ ml Stata/17
(baobab)-[alberta@login2 ~]$ srun stata-mp -h
srun: job 582114 queued and waiting for resources
srun: job 582114 has been allocated resources

stata-mp:  usage:  stata-mp [-h -q -s -b] ["stata command"]
        where:
            -h          show this display
            -q          suppress logo, initialization messages
            -s          "batch" mode creating .smcl log
            -b          "batch" mode creating .log file
            -rngstream# set rng to mt64s and set rngstream to #;
                          see "help rngstream" for more information;
                          note that there must be no space between
                          "rngstream" and #

        Notes:
            xstata-mp is the command to launch the GUI version of Stata/MP
            stata-mp  is the command to launch the console version of Stata/MP

            -b is better than "stata-mp < filename > filename".

The workaround is working with sbatch too

(baobab)-[alberta@login2 stata]$ sbatch test.sh
Submitted batch job 582125
(baobab)-[alberta@login2 stata]$ ll
total 3
-rw-r--r-- 1 alberta hpc_users  76 Apr 28 19:51 slurm-582125.out
-rw-r--r-- 1 alberta hpc_users  15 Apr 28 17:56 test.do
-rw-r--r-- 1 alberta hpc_users 804 Apr 28 19:51 test.log
-rw-r--r-- 1 alberta hpc_users 130 Apr 28 17:56 test.sh

We apologize for any inconvenience caused.

Best Regards,

–
HPC Team

Yann.Sagon · May 16, 2023, 9:28am

Dear users,

we have an issue with OpenSSL on Baobab and Yggdrasil

The version we provide through EasyBuild isn’t compatible with both CentOS and Rocky.

We recompiled two distinct version and this fixed the issue.

start: 2023-05-02T22:00:00Z
end: 2023-05-16T09:27:00Z

Adrien.Albert · May 30, 2023, 8:24am

Primary Information:

Cluster: Yggdrasil
User: ALL

Description:

We have an issue on Yggdrasil’s home. Some users may have encounter difficulties to write or read on this filesystem with this kind of message:

<FILE>: Communication error on send

Duration

start: 2023-05-29T17:20:00Z
end:2023-05-30T07:17:00Z

Gael.Rossignol · May 30, 2023, 2:55pm

Dear users,

We had an issue with login1 on Yggdrasil.

The last week maintenance introduce a problem on login1 and the result was filesystem full. We had to restart server few minutes ago to correct the problem.

Sorry for inconvenience.

start: 2023-05-30T14:00:00Z
end: 2023-05-30T15:00:00Z

Gael.Rossignol · June 6, 2023, 8:19am

Dear users,

We had an hardware issue early in the morning on scratch2.yggdrasil.

The storage of scratch has been impacted by this issue. Actually server is now rebooted and services are working again.

Sorry for inconvenience.

start: 2023-06-29T22:04:40Z
end: 2023-05-30T08:00:00Z

Gael.Rossignol · June 8, 2023, 10:45am

Dear users,

We had an second hardware issue yesterday on scratch1.yggdrasil.

The storage of scratch has been impacted by this issue.

Sorry for inconvenience.

start: 2023-06-07T15:00:00Z
end: 2023-06-07T18:00:00Z

Adrien.Albert · September 5, 2023, 7:36am

Start: 2023-09-05T07:00:00Z

Baobab scratch issue

Dear users,

We want to inform you that we are currently experiencing an issue with our storage and filesystem scratch. This may affect your access to certain files temporarily.

Our team is actively working on resolving this problem, and we apologize for any inconvenience caused. We will keep you updated on the progress.

Thank you for your understanding.

Best regards,

Status : Resolved

start: 2023-09-04T22:00:00Z
end: 2023-09-04T22:00:00Z→2023-09-05T10:20:00Z

Adrien.Albert · September 12, 2023, 8:11am

Baobab infiniband issue

Dear users,

Due to an incident that occurred last night on the infiniband fabric on the Baobab cluster. Compute nodes have been drained. We are actively working to resolve the problem. The nodes will gradually return to production.

We apologize for the inconvenience.

Thank you for your understanding.

Best regards,

Status : Resolved

start: 2023-09-12T01:26:00Z
end: 2023-09-12T09:00:00Z

Status : Resolved

Start: 2023-09-12T16:00:00Z
End:2023-09-13T10:45:00Z

Yann.Sagon · October 18, 2023, 8:21am

Yggdrasil - Astro DNS server not working

Dear users,

We want to inform you that we are currently experiencing an issue with one of the DNS server used by Yggdrasil. The resolution is out of our scope, we’ll keep you posted once this will be solved.

One of the symptom may be that sometime you’ll be unable to resolve an external hostname, for example to mount data from the NAS.

Thank you for your understanding.

Best regards,

Status : Resolved

start: 2023-10-17T15:00:00Z
end:2023-10-18T08:30:00Z

Adrien.Albert · December 1, 2023, 9:37am

Yggdrasil - Astro DNS server not working

Dear users,

We want to inform you that we are currently experiencing an issue with one of the DNS server used by Yggdrasil. The resolution is out of our scope, we’ll keep you posted once this will be solved.

One of the symptom may be that sometime you’ll be unable to resolve an external hostname, for example to mount data from the NAS.

On the HPC side, while awaiting the complete resolution of the incident, we have implemented a workaround to minimize the impact of this issue.

Thank you for your understanding.

Best regards,

Status : Ongoing :

start: 2023-11-30T21:30:00Z
end: 2023-12-01T11:00:00Z

Yann.Sagon · December 8, 2023, 2:41pm

Baobab: scratch issue

Dear users,

We had issue with scratch storage. The issue is that BeeGFS (our storage solution) shutdown itself when there is too man open files. There is nothing we can really do from our side, it is mainly caused by a buggy application running on the cluster.

We are tightly monitoring the servers. The storage is working right now.

We apologize for the inconvenience.

Thank you for your understanding.

Best regards,

Status : In progress

start: 2023-12-07T15:00:00Z
end: 2023-12-15T15:00:00Z

Yann.Sagon · December 11, 2023, 7:50am

Yggdrasil: not reachable

Dear users,

Yggdrasil isn’t reachable from internet since Saturday 09th of December. The reason is the network link (fibre) between UNIGE and ASTRO were cut by a rodent, thus no network connectivity at all. This is out of our scope, we don’t know when the network is planed to be restored. We’ll keep you posted here.

We apologize for the inconvenience.

Thank you for your understanding.

Best regards,

Status : Resolved

start: 2023-12-09T15:00:00Z
end: 2023-12-13T09:00:00Z

edit: according to a user, this may be the culprit : Ratatoskr - Wikipedia

edit2: this is as well outside of the scope of UNIGE, it depends on OFROU/OCSIN services. This is a maximal priority issue and they plan to restore the service tomorrow.