Hi, on baobab shared-gpu I am running some gromacs MD runs, more precisely I am doing replica exchange with plumed on 8 gpus and 64cpus.
This runs go smooth on all shared-gpu nodes except gpu021
I guess that the fact that there is a spitted gpu is giving some problems?
Is there a way to run on shared-gpu without gpu021 in order not to have some random runs die?
The error messages I get are:
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
Local host: gpu021
--------------------------------------------------------------------------
and
[gpu021:57682:4:57808] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe00d40e88)
[gpu021:57682:2:57817] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe00d40e88)
[gpu021:57682:1:57824] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe00d40e88)
[gpu021:57682:5:57813] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe00d40e88)
[gpu021:57682:0:57849] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffe00d40e88)
srun: First task exited 30s ago
srun: StepId=47514130.0 tasks 6-7: running
srun: StepId=47514130.0 tasks 0-5: exited abnormally
srun: launch/slurm: _step_signal: Terminating StepId=47514130.0
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 47514130.0 ON gpu021 CANCELLED AT 2021-06-09T05:25:48 ***
srun: error: gpu021: task 7: Killed
srun: error: gpu021: task 6: Killed
This is how gromacs is using the gpus:
8 GPUs selected for this run.
Mapping of GPU IDs to the 16 GPU tasks in the 8 ranks on this node:
PP:0,PME:0,PP:1,PME:1,PP:2,PME:2,PP:3,PME:3,PP:4,PME:4,PP:5,PME:5,PP:6,PME:6,PP:7,PME:7
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PME tasks will do all aspects on the GPU
As said the MD run goes on without problems in any other gpu node except gpu021