Hi,
My GROMACS jobs on yggdrasil all failed last night. I resubmitted jobs this afternoon. One runs normally with gpu007, but another one only ran for a few sec on gpu008 and failed.
Some lines of .out file:
Program: gmx mdrun, version 2023.1
Source file: src/gromacs/taskassignment/findallgputasks.cpp (line 85)
MPI rank: 0 (out of 8)
Fatal error:
Cannot run short-ranged nonbonded interactions on a GPU because no GPU is
detected.
[1707228579.325814] [gpu008:1539109:0] ib_md.c:1234 UCX WARN IB: ibv_fork_init() was disabled or failed, yet a fork() has been issued.
[1707228579.325821] [gpu008:1539109:0] ib_md.c:1235 UCX WARN IB: data corruption might occur when using registered memory.
My batchfile:
#!/bin/bash
#SBATCH --job-name="*"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=*
#SBATCH --time=12:00:00
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --partition=shared-gpu
#SBATCH --gpus=1
#========================================
module load GCC/11.3.0
module load OpenMPI/4.1.4
module load GROMACS/2023.1-CUDA-11.7.0
export OMP_NUM_THREADS=8
srun gmx_mpi mdrun -ntomp 8 -s *.tpr -deffnm * -nsteps -1 -v -pin on -nb gpu -noappend -cpi *.cpt
Best,
Jingze