No GPU is detected on yggdrasil

Dear @Jingze.Duan

I think I got it! Your job was running fine on gpu007.yggdrasil and no GPUs are seen when same job is running on gpu008.yggdrasil.

Gpu007 is equipped with 8 x Titan RTX cards and gpu008 is equipped with 8x V100 cards (see hpc:hpc_clusters [eResearch Doc]). The V100 has a compute capability of 7.0. Until 06th of December 2023, all the software using GPUs we compiled were compiled for compute capability : 6.0,6.1,7.5,8.0,8.6. We added the 7.0 since then, but not all the software were recompiled.

We saw later the following statement from Nvidia:

Each CUBIN file targets a specific compute capability version and is forward- compatible only with CUDA architectures of the same major version number; e.g., CUBIN files that target compute capability 1.0 are supported on all compute- capability 1.x (Tesla) devices but are not supported on compute-capability 2.0 (Fermi) devices.

It means we no GPU kernel were compatible for this card.

We are recompiling GROMACS right now.