Hello, I’m reading and writing some files on the scratch disk and some of the jobs failed returning a FileNotFoundError when the file clearly exists at the correct location.
I tried to move my files to my home and I don’t seem to get this error anymore. Could it be related with the scratch disk being too full? I tried deleting some of my old files but it doesn’t solve the problem.
Thanks for your help!
So I realized that the problem of the files that were not found was a memory problem; I requested too little memory for my job. Below is an example of my submit script configuration:
#!/bin/env bash
#SBATCH --time=0-00:10:00
#SBATCH --partition=debug-cpu,private-dpnc-cpu,shared-cpu,public-cpu
#SBATCH --mem=60G
#SBATCH --output=log_txt/slurm-%J.out
#SBATCH --job-name='run_GRooTrackerVtx'
With this configuration, I have the problem that sometimes (seems random), files stored on the ${SCRATCH} disk are not found. This doesn’t happen if I run the exact same script but this time reading and writing files on the ${HOME} disk (instead of ${SCRATCH}). Does reading from/writing to ${SCRATCH} takes more resources than reading from/writing to ${HOME}?
In order to be able to keep reading/writing on ${SCRATCH}), my fix to this problem was to request more memory for my jobs. E.g.:
#!/bin/env bash
#SBATCH --time=0-00:10:00
#SBATCH --partition=debug-cpu,private-dpnc-cpu,shared-cpu,public-cpu
#SBATCH --mem=80G
#SBATCH --output=log_txt/slurm-%J.out
#SBATCH --job-name='run_GRooTrackerVtx'
This works but causes a new problem. The jobs are held longer in the queue for reason (BadConstraints). After a while, the jobs start but it takes much longer than usual. Is there a workaround or good practice in this case?
Thank you!
Hi @Stephanie.Bron,
Can you share a jobid that had the issue please?
On which cluster do you submit your jobs?
No.
Can we see the remaining of your sbatch script please?