Primary informations
Username: mongin
Cluster:Baobab
Description
When running a job with srun, I have an error:
srun: error: Couldn't find the specified plugin name for mpi/pmix_v3 looking at all files
srun: error: cannot find mpi plugin for mpi/pmix_v3
srun: error: MPI: Cannot create context for mpi/pmix_v3
srun: error: MPI: Unable to load any plugin
srun: error: Invalid MPI type 'pmix_v3', --mpi=list for acceptable types
Steps to Reproduce
I am making sbatch baobab_classify_SR.bash
in [mongin@login1 classify_SR]
.
The batch file load the librarioes and call a pathon virtual env. I am able to run each part manually myself, so the problem is with srun
, can’t figure out why.
The batch file:
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --gpus=2
#SBATCH --partition=shared-gpu
#SBATCH --gres=VramPerGpu:25G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task 1
#SBATCH --mem=30000
#SBATCH --array=1,5,13
. ~/baobab_python_env_LLM3/bin/activate
ml GCC/12.3.0 OpenMPI/4.1.5 PyTorch-bundle/2.1.2-CUDA-12.1.1
srun ~/baobab_python_env_LLM3/bin/python -u classify_SR.py ${SLURM_ARRAY_TASK_ID} > ./results/classify.out
I do : sbatch baobab_classify_SR.bash
Expected Result
The file should launch the jobs, was working before (last week). Here it stops at launch, and I have the following errors in the slurm files:
srun: error: Couldn't find the specified plugin name for mpi/pmix_v3 looking at all files
srun: error: cannot find mpi plugin for mpi/pmix_v3
srun: error: MPI: Cannot create context for mpi/pmix_v3
srun: error: MPI: Unable to load any plugin
srun: error: Invalid MPI type 'pmix_v3', --mpi=list for acceptable types