This is my first time using an HPC, so some advice would be nice.
I have a large amount of small jobs (1350 jobs to be precise). I have an 8 core / 16 thread i9900k and Ryzen 2700x available, and on both the largest jobs run in about 20-30 minutes, though most are 1-2 minutes. Memory usage is up to about 1.5GB.
In this case is it best to run all of this in a job array? Each job is the same command with different parameters, but the set of parameters isn’t a product of smaller sets. What would a slurm script look like for many small jobs? For me it sounds the easiest to just generate a really long slurm script in Python, but I’m not sure if that’s the best way. (In fact that’s what I do now, I use a Python script to run this command with different parameters in serial.)
I suppose that in this case it would be best to dedicate 1 core per job (with the default 3GB of RAM). But what should I put as the runtime per job? Or is it the total runtime of all jobs the thing you specify? In that case, it’s better to schedule a 12h job, and just see how much it did in those 12 hours, and submit another job if possible, right?
Extrapolating from my workstation, the longest a single job could run is about 4-6 hours on one core (since server CPU’s tend to have lower clock rate than desktop CPU’s).
Thanks!
If I understand correctly your problem, you should use job arrays with 1 core per job.
Start by running a few jobs (lets say 20) just to check the actual runtime (a single cluster CPU may be slower than the one in your machine).
For the maximum runtime it is better to aim small at first, your job will start faster. With job array you must input the max runtime per job and not for the whole array. If some job did not finish, you can rerun just those with a larger max runtime.
If you need more advice/help, you can contact SciCoS.
Thanks for the advice, it makes sense.
I now tried running an test script, but it seems nothing happens.
First I installed my conda
environment on the login partition and activated the conda
environment, and then I ran the following script with sbatch
. I don’t see any logs produced, don’t get any emails, and I don’t see the job show up in the Baobab web interface (although the ‘history’ tab seems broken of the web interface?). It just says e.g. Submitted batch job 49872535
#!/bin/bash
#SBATCH --partition=debug-cpu
#SBATCH --time=00:01:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=3000 # in MB
#SBATCH --mail-user=rik.voorhaar@unige.ch
#SBATCH --mail-type=ALL
#SBATCH -o logs/myjob-%A-%a.out
#SBATCH --array=0-19
srun python srun-benchmark.py
The script srun-benchmark.py
for now just prints out a list of commands it will run based on the SLURM_ARRAY_TASK_ID
environment variable.
If I change out the
srun python srun-benchmark.py
for
srun echo "I'm task_id " ${SLURM_ARRAY_TASK_ID} " on node " $(hostname)
still nothing happens, so there seems to be something fundamentally wrong with what I’m doing.
Edit: Nevermind, the logs appeared in a different location, the #SBATCH -o
doesn’t seem to work?
The problem was that it can’t find the file, probably CWD is wrong.
Hi
it should work yes. Do you have a “logs” directory in the same location as you launched the job?
By default the working directory is the place where your script is called. You can specify the working directory if needed: #SBATCH --chdir=<directory
.