[Bamboo] module command not found

Hello,

Primary informations

Username: dumoulil
Cluster: Bamboo

Description

When submitting a job I get this error:

/var/spool/slurmd/job66710/slurm_script: line 33: module: command not found
srun: error: gpu003: task 0: Exited with exit code 2
slurmstepd: error: execve(): julia: No such file or directory

Some context

As the A100 are split on different clusters, I made a custom julia interface that launch my job-arrays on all (or not) the clusters. Then, in my sbatch, clusters communicate with each other to know wich array_task_id has to run.
It is working fine on Baobab but on Bamboo I get the previous error. The communication between clusters works fine.
The sbatches are automatically generated by a julia funcion. The last lines of sbatches are then exactly the same, independantly of the cluster. I don’t understand why it is working on Baobab and not on Bamboo.

Here is my sbatch

For Bamboo, the sbatch is

#!/bin/env bash
#SBATCH --array=[2-40:3,3-40:3,1-40:3]%20
#SBATCH --partition=shared-gpu
#SBATCH --time=0-01:00:00
#SBATCH --output=%J.out
#SBATCH --mem=3000  
#SBATCH --gpus=ampere:1 
#SBATCH --constraint=DOUBLE_PRECISION_GPU

export idx=${SLURM_ARRAY_TASK_ID}
echo $idx

jobID1=$( ssh dumoulil@login1.baobab.hpc.unige.ch 'squeue --me --name="testMC.sh" --Format="ArrayJobID" -h | uniq ')
if [ $(echo -n $jobID1 | wc -c) -gt 1 ]
then
    state1=$( echo -n $(ssh dumoulil@login1.baobab.hpc.unige.ch "sacct -u dumoulil --jobs='${jobID1}_${idx}.0' -n --format=state ") | wc -c)
else
    state1=0
fi

jobID3=$( ssh dumoulil@login1.yggdrasil.hpc.unige.ch 'squeue --me --name="testMC.sh" --Format="ArrayJobID" -h | uniq ')
if [ $(echo -n $jobID3 | wc -c) -gt 1 ]
then
    state3=$( echo -n $(ssh dumoulil@login1.yggdrasil.hpc.unige.ch "sacct -u dumoulil --jobs='${jobID3}_${idx}.0' -n --format=state ") | wc -c)
else
    state3=0
fi

if [ $state1 -gt 1 ] || [ $state3 -gt 1 ] || false
then 
    scancel ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
else
    module load Julia
    cd /srv/beegfs/scratch/users/d/dumoulil/Data/debug/test_MC/
    srun julia --optimize=3 /home/users/d/dumoulil/Code/MC/something.jl
fi

Expected Result

Julia is loaded

Actual Result

/var/spool/slurmd/job66710/slurm_script: line 33: module: command not found
srun: error: gpu003: task 0: Exited with exit code 2
slurmstepd: error: execve(): julia: No such file or directory

Thank you for your help,

PS: I know that it is not correct yet, I still need to find a way to correct this line #SBATCH --array=[2-40:3,3-40:3,1-40:3]%20 and to check if the job array is fully completed (not in queue) on an other cluster before checking the queue.

Dear @Ludovic.Dumoulin

first remark about your script. No need to ssh to each cluster, you can just use squeue (same for sacct) like that:

(bamboo)-[sagon@login1 ~]$ squeue --user dumoulil --cluster=all
CLUSTER: bamboo
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

CLUSTER: baobab
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

CLUSTER: yggdrasil
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
1 Like

Thank you,
I chose to use squeue for each cluster because users in my lab can choose to submit jobs on either one, two or all clusters. In fact, they only provide a list of hosts, and then the sbatch lines are generated using a for loop over the host names.
I know very little about SLURM or bash, so if you have any other suggestions to improve the sbatch script, I would gladly appreciate them.

you can check the job status from any cluster by using squeue. Example to check the job running on bamboo.

squeue --cluster=bamboo --me --name="testMC.sh" --Format="ArrayJobID" --noheader

For your module error: I tried to reproduce your issue by running your sbatch and it works.

How are you launching the sbatch?

1 Like

Thank you ! I’ll update the function generating the sbatch

I just do sbatch testMC.sh

It is also weird because if i connect to the login node I can load Julia.

Did you had the issue multiple time? Always on the same compute node?

Maybe it was a temporary issue that was fixed in the meantime? Can you please give another try?

Best

Yann

I had this issue multiple time, I deleted the .out then I don’t remember.
I am trying again now, the jobs are pending,

Thanks,
Ludovic

I still have the same issue, always on GPU003,

Best

Please confirm this is how you proceed to start your job:

  • you connect from your laptop to login1.bamboo.hpc.unige.ch using ssh
  • you type sbatch testMC.sh in the terminal

The sbatches are sent to the cluster using the same julia function, in this function the line starting the job is:

SSH.ssh("cd $cluster_saving_directory && sbatch $sh_name", host=h)

with

cluster_saving_directory="/srv/beegfs/scratch/users/d/dumoulil/Data/debug/test_MC/"
sh_name="testMC.sh"
h="@login1.bamboo.hpc.unige.ch" # for bamboo

and my ssh() function defined as

ssh(cmd; host=host) = readchomp(`ssh $username$host $cmd`)

It has been working well on Baobab for a few years. On Bamboo, the bash script executes correctly if the if is true, problem arrise only with module.

Then concretly it is similar to typing

ssh dumoulil@login1.bamboo.hpc.unige.ch "cd /srv/beegfs/scratch/users/d/dumoulil/Data/debug/test_MC/ && sbatch testMC.sh"

Dear @Ludovic.Dumoulin thanks for the information. I was able to reproduce the issue. In fact, based on our understanding, the correct behavior is on Bamboo.

You need to source a file to enable module. You can test like that:

ssh login1.bamboo "source /etc/profile.d/modules.sh ; module load Julia; ml"

Best

Yann

edit: after more investigation from my colleague @Adrien.Albert it appears the initial users home directory wasn’t fully populated with correct .bashrc. If interested, you can copy all the files in /etc/skel (hiden files starting with .) in your home directory. Anyway, the workaround wrote in this post is correct as well.

1 Like