One of the scratch server server crashed. We had to restart it. Unfortunately, home and scratch are sharing some hardware, thus we had a short outage on home too.
The team managing the storage for the servers did a maintenance this morning and all our admin servers crashed. We are investigating as normally it is fully redundant
In the meantime, the running jobs are probably still running but slurm is stopped.
edit: the service is restored. We’ll now investigate with the storage team why this happened
Due to the update to OpenOnDemand version 4, we are currently experiencing some portability issues. We are working on resolving these and applying the necessary adjustments.
After reinstalling the gpu nodes on Bamboo, we saw the / partition was too small to host all the CUDA rpms + tmp space. We had to reinstall all the GPU node