Primary informations
Username: dinelli
Cluster: baobab
Subject: slurm|private-kruse-gpu
Description
I have issues running jobs on pivate-kruse-gpu for more than 12 hours.
The bash file to run my jobs can be found in:
scratch/HighResolution-Lattice/HighResolution.sh
This is its content:
#!/bin/env bash
#SBATCH --array=1-7
#SBATCH --partition=private-kruse-gpu
#SBATCH --time=0-72:00:00
#SBATCH --output=%J.out
#SBATCH --mem=3000
#SBATCH --gpus=1
#SBATCH --constraint=nvidia_a100-pcie-40gb|nvidia_a100_80gb_pcie|nvidia_h100_nvl
module load Julia
cd /srv/beegfs/scratch/users/d/dinelli/HighResolution-Lattice/
srun julia --optimize=3 /home/users/d/dinelli/Code/HighResolution-Lattice/NematoPolar-FFT-Main.jl
Note that this is a script I have been using with no issue until last Thursday. Now, when I do sbatch HighResolution.sh, my jobs remain pending with the following justification:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
7884341_[1-7] private-kruse-gpu HighReso dinelli PD 0:00 1 (ReqNodeNotAvail, Reserved for maintenance)
However, when running 12h-simulations:
#SBATCH --time=0-12:00:00
my jobs run fine. I am surprised since we should be allowed to run jobs for more than 12h on the private partition.
Could you help me understand what is going wrong?
Thanks for the help in advance!
Alberto