The new cluster Yggdrasil has been installed at Sauverny the 24th of February.
We’ll start to install the software and do the configuration in the following weeks.
The cluster will be quite similar to Baobab, but you have now the opportunity to let us know (and the other users) what would be important to have in the new cluster.
Feel free to discuss in this topic with ideas for your new cluster.
Some pictures of the physical installation done by Dalco. Yggdrasil will spread on 6 rack.
4 people came from Zurich to do the actual installation. Rear view.
Front view, in progress.
Rear view, cabling in progress.
Given the problem we had before of use of a lot memory resources on the login node on baobab and that for a few use cases the login node is a good place to put network resources like database.
I think having a few special node for launching special purpose not computationally expensive task (less than 1 core usage) would make sense.
I would see two kinds of nodes:
- Publicly accessible node: For file transfer task and other database that need to be accessed from the outside.
- Private accessible node: For database and other to access only from compute node.
I’m not sure this idea is realizable given the security implication but I believe this could lead to unclutter the login node of process.
A second idea, would be to enable in gitlab the repository option to allows to store “private” repository to a group and easily use them with singularity. Another improvement for gitlab would be to have preinstalled a few gitlabrunner to be able to directly execute short pipeline for quick compilation and test.
The core idea of theses two suggestions is to offer some “auxiliary” services to prepare and manage job. To be clear the proposition is not to offer a full cloud service only a better integration of existing one.
thanks for this request!
for the some of the Astro/CDCI activities and also INTEGRAL project, I have several points:
for us, singularity resolves most of the software needs, so as long as this is maintained, a lot of the issues are circumvented.
our analysis is often limited by the storage access. Will there be an extra-favorable connection to any of the storages at the Sauverny?
related to the points raised by @Pablo.Strasser .
We use a k8s cluster for various services: public or project-restricted, and its linked with baobab in several ways. We’d want to make sure this integration remains efficient and possible improves.
So far we did not experience any issues connecting from baobab to the k8s, but we’d want to scale up the data transfer, and I suppose we might benefit from the proximity of yggdrasil and the k8s.
and although we use k8s group-limited gitlab runners, it might be indeed beneficial to also have some instance-wide runners, for the simple projects/groups and some individual users.
Dear HPC team,
Now the name has been chosen (I voted for this one!), I was wondering when Yggdrasil will be available.
Is it going to be exactly the same set-up as baobab? The same environment?
I sometimes run very long simulations (>1 month), is there going to be queues with no time limit? (please!)
Or what’s the longest time limit you’ll propose?
Thanks in advance!
Hi HPC team,
Thanks for the update! The installation photos are impressive!
Is there an expected date at which we will be able to access the cluster? like in June for instance? or before? or after?
As I mentioned in my message earlier, that would be great if there could be some ressource for time consuming simulations!
I know the idea of having no time limits on jobs is not very popular among you (although in the previous places I worked at with big clusters as well, it existed). But that’d be great to have that, or at least be able to calculate for more than 4 days…
Four days is awfully short, and it penalizes us to have to re-submit at that frequency: even with a submission script which does it on its own (thanks for the pointer to do that btw), for each resubmission of this 1 simulation, the priority decreases each time. Which means that a job already taking a long time (e.g. a month), will take much more because as time goes by, the job will spend an increasing amount of time in “pending”.
Thanks for the great work, I’m really looking forward to experimenting Yggdrasil!
PS: I should say something about the usage I do of the cluster: I’ll have big periods of inactivity but then at some point, I’ll need to launch a bunch of such simulations (around 10 or so).
I also have a student who will launch at some point a bunch of long parallel simulations.
Dear HPC team,
I concur with the need of longer run time for jobs.
I do particle physics simulations, and the jobs typically have a long duration for very high energy particles. Due to the current 12h time limit of Baobab standard queue, we set up our jobs to simulate only few events per job, which makes them quite vulnerable to randomness: there is a high variance in job runtime. Therefore I’m often booking a full 12h for jobs that may finish in 1h (or may not - most are below 6h). An additional problem is that small statistics per job lead to a very high number of jobs, and thus a very high number of files. Which, as per Yann Sagon’s post of today, is not optimal for server performances.
If we could run longer duration jobs, we’d have a more stable estimate and much fewer files / total jobs.
I however realise that there are multiple downsides to allowing longer duration jobs, especially the risk of clogging up the cluster for even longer periods of time than currently.
a graphical monitoring interface, such as Grafana, would be nice to have for visualising user, partition, and job efficiency statistics.