Hello all,
I have been noticing a strange glitch on gpu044 this morning, whereby I am unable to get CUDA to work with my training scripts.
Here’s my (minimal example) sbatch script:
#!/bin/sh
#SBATCH --job-name=curtains
#SBATCH --cpus-per-task=2
#SBATCH --time=00-10:00:00
#SBATCH --partition=private-dpnc-gpu,shared-gpu
#SBATCH --output=/home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2/jobs/slurm-%A-%x_%a.out
#SBATCH --chdir=/home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2
#SBATCH --mem=10GB
#SBATCH --gres=gpu:1
#SBATCH -a 0-11
export XDG_RUNTIME_DIR=""
module load GCC/9.3.0 Singularity/3.7.3-Go-1.14
export PYTHONPATH=${PWD}:${PWD}/python_install:${PYTHONPATH}
srun singularity exec --nv -B /home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2,/home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2/curtains,/srv/beegfs/scratch/users/s/senguptd/:/scratch/,/srv/beegfs/scratch/groups/rodem/LHCO:/lhco_dir,/srv/beegfs/scratch/groups/rodem/anomalous_jets/:/srv/beegfs/scratch/groups/rodem/anomalous_jets/,/srv/beegfs/scratch/groups/rodem/RPVMultiJets/:/rpvmj_dir /home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2/container/id_init.sif\
python3 /home/users/s/senguptd/UniGe/Anomaly/curtains/curtains2/experiments/sb_to_sb.py
Disclaimer: I have tried using #SBATCH gpus=1
with similar results on gpu044.
This is the output I get.
/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1678411187366/work/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0
Jobs launched through the same script and running on other nodes are running as usual.
Is there anyone else who’s facing a similar issue?
Current jobs running on the node: 829513_0-7
Regards,
Deb