We have made a change to the way GPUs are allocated.
Previously, if you requested a GPU without specifying any constraints, you might have been allocated either a high- or low-end GPU.
To prevent this from happening, we have assigned a weight to each GPU. Low-end GPUs have a low weight and high-end GPUs have a high weight. When you request a GPU, those with a lower weight are prioritised. It was already the case for some old GPUs, but now it is up to date and based on the GPU billing value.
This has two benefits: youâll get lower-cost GPUs if you donât need a more powerful one, and we can reserve the high-end GPUs for those who need them.
This happens because we manually updated by copy-paste the doc, not very efficient!
We are using slurm autodetect for gpus, thus the GPU name are automatically set and not very coherent. Anyway, the good news is that for a single GPU model, the name is always the same!
You can list all the GPU models with their correct name like this:
sinfo -N -o "%N %G"
Weâll soon autogenerate the documentation with a script to be more accurate.
Then in your case, you probably wanted to request an A100 with 40 or 80GB? This is another change we did, not announced, because it is for our internal purpose, but, unfortunately I just figured out that the âDOUBLE_PRECISION_GPUâ Feature (to be used with --constraint) is missing. Weâll correct that this afternoon I hope.
In the meantime what you can do is to identify every compute node which has an A100 and specify this nodelist.
Sorry for the inconvenience, Iâll keep you posted.
We now expose the GPU model in each nodeâs Feature field, allowing users to request jobs matching several specific GPU models. Previously, filtering was coarse (e.g., âAmpere with â„40GBâ), which could result in receiving an unintended GPU type such as an RTX A6000. Since users can now combine multiple GPU models in a single constraint expression, we removed the outdated DOUBLE_PRECISION constraint. Researchers should now explicitly select the GPU models they want.
Before I update our labâs internal job submission interface to match these changes, I have a few suggestions regarding the naming conventions. Currently, there are some inconsistencies that make automated scripting difficult for users:
Vendor Prefix Consistency: Most GPUs start with nvidia_, but the P100s are still named tesla_p100-pcie-12gb. Standardizing all of them to start with the vendor name (e.g., nvidia_p100...) would make it easier to parse and select GPUs programmatically.
Naming Format: Most models follow the vendor_model_interface_vram pattern, but the A100s use a different order (nvidia_A100_vram_pcie). A unified string format across all models would be ideal.
Relevance of Interface (PCIe/NVL) vs. P2P: Since there isnât currently a choice between PCIe and NVL for the same model, including these tags might be redundant. However, knowing if Peer-to-Peer (P2P) communication is supported is crucial. For example, node gpu[006] on Bamboo supports P2P, while pu[004-005]do not, despite similar names. Perhaps a _p2p suffix would be more informative than the bus type?
For our labâs workflow, we typically filter GPUs based on:
Vendor: (NVIDIA/AMD/Intel) depending on the framework (CUDA, ROCm, OneAPI).
Model/Compute Capability: (e.g., CC 8.0 for double-precision FMA using Tensor Cores).
VRAM: Based on the system size.
P2P Support: Essential for multi-GPU simulations.
I have already submitted the survey, but I wanted to share these specific thoughts on naming schemes. While these reflect our labâs needs, I believe a more coherent naming convention would benefit the entire user base.
As I said, we rely on autodetect naming from nvml/slurm Slurm Workload Manager - Generic Resource (GRES) Scheduling and thus have no control over the naming scheme. But yes I fully agree that is is incoherent! There is already a case opened at schedmd. But if they do something, youâll have to update your configuration again:). Iâll post in this issue.
As the word âcrucialâ is present in the text, it is clear that it was generated by ChatGPT. no offense.
One option would be to add a flag, such as âNVLINKâ. However, the issue then arises that there are no guarantees that, if you request two GPUs on the compute node, they will be connected together. Why is that? If the GPUs are connected in pairs, perhaps the other GPUs are already in use and there are no free pairs. I have no idea how to ensure that two GPUs are linked together, and as far as I know, SLURM doesnât have a specific flag to enforce this request.