Gpu023 being slow

Hi,

I’m currently running a grid search and all the jobs on gpu023 are really slow. There is likely nothing wrong with the training script as it runs fine on gpu012 run as expected.

htop results:

however, gpu utilisation is almost nil.

Any help?

Please at least show your sbatch script. Are you sure your software is not using CPUs instead of GPUs?

Here’s the sbatch script:

#!/bin/sh
#SBATCH --job-name=clean
#SBATCH --cpus-per-task=1
#SBATCH --time=01-20:00:00
#SBATCH --partition=private-dpnc-gpu
#SBATCH --output=/home/users/s/senguptd/UniGe/Anomaly/curtains/CURTAINS/jobs/slurm-%A-%x_%a.out
#SBATCH --chdir=/home/users/s/senguptd/UniGe/Anomaly/curtains/CURTAINS
#SBATCH --mem=16GB
#SBATCH --gpus=1
#SBATCH --constraint="COMPUTE_TYPE_RTX|COMPUTE_TYPE_AMPERE"
#SBATCH -a 0-4
export XDG_RUNTIME_DIR=""
module load GCCcore/8.2.0 Singularity/3.4.0-Go-1.12
bins=(250,300,350,400,450,500 300,350,400,450,500,550 350,400,450,500,550,600 400,450,500,550,600,650 450,500,550,600,650,700)
mix_sb=(2)
doping=(0)
feature_type=(12)
load=(1)
load_classifiers=(0)
distance=(sinkhorn_slow)
coupling=(1)
spline=(1)
two_way=(1)
shuffle=(1)
coupling_width=(32)
coupling_depth=(2)
batch_size=(256)
epochs=(1000)
nstack=(8)
nblocks=(3)
nodes=(20)
activ=(leaky_relu)
lr=(0.0001)
reduce_lr_plat=(0)
gclip=(5)
nbins=(4)
ncond=(1)
load_best=(0)
det_beta=(0.0)
sample_m_train=(0)
oversample=(4)
use_mass_sampler=(1)
light=(0)
plot=(1)
log_dir=(paper-clean)

srun singularity exec --nv -B /srv/beegfs/scratch/groups/rodem/LHCO/ /home/users/s/senguptd/UniGe/Anomaly/curtains/CURTAINS/container/latest_latest.sif\
	python3 /home/users/s/senguptd/UniGe/Anomaly/curtains/CURTAINS/CURTAINS.py -d fat -n clean_${SLURM_ARRAY_TASK_ID} \
		--bins ${bins[`expr ${SLURM_ARRAY_TASK_ID} / 1 % 5`]}\
		--mix_sb ${mix_sb[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--doping ${doping[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--feature_type ${feature_type[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--load ${load[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--load_classifiers ${load_classifiers[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--distance ${distance[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--coupling ${coupling[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--spline ${spline[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--two_way ${two_way[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--shuffle ${shuffle[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--coupling_width ${coupling_width[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--coupling_depth ${coupling_depth[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--batch_size ${batch_size[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--epochs ${epochs[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--nstack ${nstack[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--nblocks ${nblocks[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--nodes ${nodes[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--activ ${activ[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--lr ${lr[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--reduce_lr_plat ${reduce_lr_plat[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--gclip ${gclip[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--nbins ${nbins[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--ncond ${ncond[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--load_best ${load_best[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--det_beta ${det_beta[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--sample_m_train ${sample_m_train[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--oversample ${oversample[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--use_mass_sampler ${use_mass_sampler[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--light ${light[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--plot ${plot[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		--log_dir ${log_dir[`expr ${SLURM_ARRAY_TASK_ID} / 5 % 1`]}\
		

It’s meant to use the gpus.

The last job (54811591_0) that ran on gpu012 - ran fine and finished the training within ~12 hours. (They failed later due to a bug, but that is being fixed now). On gpu023 the training (54811591_1) itself took really long.

Hi Yann,

In an parallel note, for debugging situations like this, do you know of a way through slurm to access the UUID of the GPU which has been allocated to each job? This way we can be certain of which card to look at with nvidia-smi.

Cheers,
Johnny

Hi, what I can see is that your job seems to use the CPU and GPU. According to htop, each of your process is using 100% of the allocated CPU. On gpu012, the CPU has a frequency of 3.4GHz and on gpu023, the CPU frequency is only 2.2 GHz, this may be an explanation. Is it possible for you job to use more than one CPU per task maybe? On gpu023, there are 128 CPUs.

Hi,

At the time of writing, I am the only user running jobs on both gpu023 and gpu024.

When I check the gpu and cpu utilisation on both ~

  1. on gpu023 the CPU utilisation is maxed at 100% and GPU is ~10%.
  2. on gpu024 the CPU utilisation is maxed at 100% and GPU at ~ 24%.

GPU023


GPU024


The slurm script for these jobs had the following specs:

#!/bin/sh
#SBATCH --job-name=hpfatwide
#SBATCH --cpus-per-task=2
#SBATCH --time=01-20:00:00
#SBATCH --partition=private-dpnc-gpu
#SBATCH --output=/home/users/s/senguptd/UniGe/Anomaly/curtains/CURTAINS/jobs/slurm-%A-%x_%a.out
#SBATCH --chdir=/home/users/s/senguptd/UniGe/Anomaly/curtains/CURTAINS
#SBATCH --mem=16GB
#SBATCH --gpus=1
#SBATCH --constraint="COMPUTE_TYPE_RTX|COMPUTE_TYPE_AMPERE"
#SBATCH -a 0-80

Hi, good news, it seems one of the power supply wasn’t working on gpu023. When this happens, the CPUs are throttled. After re seating the power, I can see that your job has now GPUs load at ~20%.

Now, the question is why the power supply wasn’t working at first:)

We are installing a new monitoring system, we’ll add this to check.

Best

2 Likes

Hi yann,

Did the monitoring system reveal anything? Looks like it’s being slow again.

Hi, you are right. And no the monitoring didn’t saw it. We have to go physically in front of the server to see that there is an orange light. :sob: . I’ve asked the vendor if there is a more modern way to do it.

Hi Yann,

is it possible for DALCO to repair or replace the seemingly faulty power supply?
A 20x increase in required time for training on a GPU due to throttling is not something we can live with.

Cheers,
Johnny

Hi, I changed the PSU 2022-05-09T22:00:00Z.