You guys have helped me compile Fall3d and I am now migrating my scripts from my old cluster. Below is the content of my slurm file, where I adapted everything but the last line.
Sorry for the clueless question but how do I find the name of the actual library to call? I.e. in the following example, I used to run the model using
fall3d.r8.x, but I get the error
No such file or directory
module load GCC/9.3.0 OpenMPI/4.0.3 fall3d/8.0.1
srun fall3d.r8.x all CC2011_mean.inp 4 4 2
Thanks! You can make fun of me during the next HPC lunch
@Yann.Sagon Sorry to hit you again. After loading Fall3d, typing
fal and hitting Tab shows only this:
[email@example.com ~]$ fal
According to the doc, I should be able to run it following something like that:
mpirun -n np Fall3d.x Task name.inp [nx ny nz]
It seems that now “exe” file is available. Can you double check my sanity when you’ve got a sec?
Remember that Linux is case sensitive for the files:
[sagon@login2 ~] $ ml GCC/9.3.0 OpenMPI/4.0.3 fall3d/8.0.1
[sagon@login2 ~] $ which Fall3d.r8.x
Every software that we install are store in this place:
[sagon@login2 ~] $ ls $EBROOTFALL3D/bin
Do not use
mpirun -n np but srun instead: your sbatch is correct.
Check, thanks a lot, and of course for MPI
I keep on this thread so it is hopefully easier to read back (shout if you’d prefer me to open a different one - or if these sorts of rambling should not be posted on the forum). I managed to get Fall3d jobs running but I have some questions/issues.
Firstly - the way I run the code. From before on Baobab (i.e., before I left Geneva), I used to run Fall3d with
srun Fall3d in a bash file and then call
./bashfile.sh to submit the job. That, for some reason, now returns the splash screen of the model I am using. The only was I found to submit my job is to keep
srun Fall3d in the bash file, but then submit it with
sbatch from the terminal. So I just wanted to make sure this was ok. Just in case, all the files are in
Second - is there a way to monitor the CPU usage of a given job in Yggdrasil? On Baobab, I remember there was a GUI. Can you do something like that from the terminal with Slurm?
Finally, some of my jobs are cancelled as soon as I submit them, but the log and error files are empty. Are there ways to activate a more complete output for debugging?
Thanks a lot!
Good morning! Here is an update. My code is running fine but got killed by timeout (set to 12h). I suspect it is not using all ressources since it is a benchmark that I was able to run in ~7h on a previous and likely older cluster. Therefore the monitoring aspect of the previous question becomes important. Can you please let me know when you have a sec?
Thanks a lot
Yes it is better to open a new thread for the next topic. The issue with continuing an old one is that for us it appears as “solved” and we won’t notice there is a continuation.
sbatch is the way to go. If you submit your job with
./bashfile.sh this won’t return until the job is finished and the pragma such as
#SBATCH XXX won’t be taken into account.
Something like that? hpc:hpc_clusters [eResearch Doc]
You can as well connect using ssh to the node where your job is running and type
htop to see the live performance of the node.
Do you have jobs id so we can check if there is an issue from our side? You can add debug messages in your
sbatch script using
echo blabla if you suspect that a command will crash.
I checked your
sbatch script. Some comments (non relevant lines removed):
1 #SBATCH --time=0-12:15:00
2 #SBATCH --partition=public-cpu
3 #SBATCH --output=slurm-%J.out
4 #SBATCH --ntasks=64
5 #SBATCH --error jobname-error.e%j
6 #SBATCH --output jobname-out.o%j
7 #SBATCH --mail-user=xxx
8 #SBATCH --mail-type=ALL
9 srun Fall3d.r8.x ALL CC2011_mean.inp 4 4 4 >& $SLURM_JOBID.log 2>&1
As you are using almost 12h, try to increase the tasks number, reduce the time limit to 12h00 and use the partition
You are specifying two time the
--output option with different options (l3 and l6).
On l9, you are in fact “overriding” what you specified on l3,l5,l6. On l9. And the syntax you use to redirect your logs on l9 is weird. I’ll remove the redirect on l9 and remove l3,5,6.
Merci pour ta réponse! And sorry about the multi-threading.
I followed your advises and I managed to use 256 tasks, which reduced the time to ~5h. I still think some performance is lost though. I understand the
seffcommand might not always be reliable, but it shows I am using 0% CPU. Another insight is that Fall3d shows a CPU time of 14800 sec, i.e., ~4h. Does that say something about efficiency?
Sorry for the cluelessness of the questions and thanks for the help. Have a good week ahead!