Hello,
Primary informations
Username: dumoulil
Cluster: Bamboo
Description
When submitting a job I get this error:
/var/spool/slurmd/job66710/slurm_script: line 33: module: command not found
srun: error: gpu003: task 0: Exited with exit code 2
slurmstepd: error: execve(): julia: No such file or directory
Some context
As the A100 are split on different clusters, I made a custom julia interface that launch my job-arrays on all (or not) the clusters. Then, in my sbatch, clusters communicate with each other to know wich array_task_id has to run.
It is working fine on Baobab but on Bamboo I get the previous error. The communication between clusters works fine.
The sbatches are automatically generated by a julia funcion. The last lines of sbatches are then exactly the same, independantly of the cluster. I don’t understand why it is working on Baobab and not on Bamboo.
Here is my sbatch
For Bamboo, the sbatch is
#!/bin/env bash
#SBATCH --array=[2-40:3,3-40:3,1-40:3]%20
#SBATCH --partition=shared-gpu
#SBATCH --time=0-01:00:00
#SBATCH --output=%J.out
#SBATCH --mem=3000
#SBATCH --gpus=ampere:1
#SBATCH --constraint=DOUBLE_PRECISION_GPU
export idx=${SLURM_ARRAY_TASK_ID}
echo $idx
jobID1=$( ssh dumoulil@login1.baobab.hpc.unige.ch 'squeue --me --name="testMC.sh" --Format="ArrayJobID" -h | uniq ')
if [ $(echo -n $jobID1 | wc -c) -gt 1 ]
then
state1=$( echo -n $(ssh dumoulil@login1.baobab.hpc.unige.ch "sacct -u dumoulil --jobs='${jobID1}_${idx}.0' -n --format=state ") | wc -c)
else
state1=0
fi
jobID3=$( ssh dumoulil@login1.yggdrasil.hpc.unige.ch 'squeue --me --name="testMC.sh" --Format="ArrayJobID" -h | uniq ')
if [ $(echo -n $jobID3 | wc -c) -gt 1 ]
then
state3=$( echo -n $(ssh dumoulil@login1.yggdrasil.hpc.unige.ch "sacct -u dumoulil --jobs='${jobID3}_${idx}.0' -n --format=state ") | wc -c)
else
state3=0
fi
if [ $state1 -gt 1 ] || [ $state3 -gt 1 ] || false
then
scancel ${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}
else
module load Julia
cd /srv/beegfs/scratch/users/d/dumoulil/Data/debug/test_MC/
srun julia --optimize=3 /home/users/d/dumoulil/Code/MC/something.jl
fi
Expected Result
Julia is loaded
Actual Result
/var/spool/slurmd/job66710/slurm_script: line 33: module: command not found
srun: error: gpu003: task 0: Exited with exit code 2
slurmstepd: error: execve(): julia: No such file or directory
Thank you for your help,
PS: I know that it is not correct yet, I still need to find a way to correct this line #SBATCH --array=[2-40:3,3-40:3,1-40:3]%20
and to check if the job array is fully completed (not in queue) on an other cluster before checking the queue.
Dear @Ludovic.Dumoulin
first remark about your script. No need to ssh to each cluster, you can just use squeue
(same for sacct) like that:
(bamboo)-[sagon@login1 ~]$ squeue --user dumoulil --cluster=all
CLUSTER: bamboo
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
CLUSTER: baobab
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
CLUSTER: yggdrasil
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 Like
Thank you,
I chose to use squeue
for each cluster because users in my lab can choose to submit jobs on either one, two or all clusters. In fact, they only provide a list of hosts, and then the sbatch lines are generated using a for loop over the host names.
I know very little about SLURM or bash, so if you have any other suggestions to improve the sbatch script, I would gladly appreciate them.
you can check the job status from any cluster by using squeue. Example to check the job running on bamboo.
squeue --cluster=bamboo --me --name="testMC.sh" --Format="ArrayJobID" --noheader
For your module error: I tried to reproduce your issue by running your sbatch and it works.
How are you launching the sbatch?
1 Like
Thank you ! I’ll update the function generating the sbatch
I just do sbatch testMC.sh
It is also weird because if i connect to the login node I can load Julia.
Did you had the issue multiple time? Always on the same compute node?
Maybe it was a temporary issue that was fixed in the meantime? Can you please give another try?
Best
Yann
I had this issue multiple time, I deleted the .out then I don’t remember.
I am trying again now, the jobs are pending,
Thanks,
Ludovic
I still have the same issue, always on GPU003,
Best
Please confirm this is how you proceed to start your job:
- you connect from your laptop to login1.bamboo.hpc.unige.ch using ssh
- you type
sbatch testMC.sh
in the terminal
The sbatches are sent to the cluster using the same julia function, in this function the line starting the job is:
SSH.ssh("cd $cluster_saving_directory && sbatch $sh_name", host=h)
with
cluster_saving_directory="/srv/beegfs/scratch/users/d/dumoulil/Data/debug/test_MC/"
sh_name="testMC.sh"
h="@login1.bamboo.hpc.unige.ch" # for bamboo
and my ssh()
function defined as
ssh(cmd; host=host) = readchomp(`ssh $username$host $cmd`)
It has been working well on Baobab for a few years. On Bamboo, the bash script executes correctly if the if
is true, problem arrise only with module.
Then concretly it is similar to typing
ssh dumoulil@login1.bamboo.hpc.unige.ch "cd /srv/beegfs/scratch/users/d/dumoulil/Data/debug/test_MC/ && sbatch testMC.sh"
Dear @Ludovic.Dumoulin thanks for the information. I was able to reproduce the issue. In fact, based on our understanding, the correct behavior is on Bamboo.
You need to source a file to enable module. You can test like that:
ssh login1.bamboo "source /etc/profile.d/modules.sh ; module load Julia; ml"
Best
Yann
edit: after more investigation from my colleague @Adrien.Albert it appears the initial users home directory wasn’t fully populated with correct .bashrc
. If interested, you can copy all the files in /etc/skel
(hiden files starting with .
) in your home directory. Anyway, the workaround wrote in this post is correct as well.
1 Like