Hi I am trying to run an export job (involves pytorch) with the following job specifications
#!/bin/sh
#SBATCH --job-name=export
#SBATCH --cpus-per-task=12
#SBATCH --time=1:00:00
#SBATCH --partition=private-dpnc-gpu,shared-gpu
#SBATCH --output=/home/users/s/senguptd/UniGe/astro/skycurtains/logs/%A_%a.out
#SBATCH --chdir=/home/users/s/senguptd/UniGe/astro/skycurtains/
#SBATCH --mem=15GB
#SBATCH --gpus=1
I get the following error Too many open files. Communication with the workers is no longer possible.
But this seems to happen only on certain nodes gpu023, gpu024, gpu038. This worked on gpu040 for instance. Is the number of cpu requests too high for the former nodes and I need to turn those down?
What I can confirm is that within the code there are no files that are opened outside of a context manager.
Any help figuring this issue would be appreciated. If you require additional information, please let me know.
Thanks,
Deb