Slow CPU usage on gpu017

Hello,

Last week I submitted a job array for private-dpnc-gpu requesting 1 GPU and 6 CPUs per job.
(I know this would saturate gpu012 but it was a small quick test and the queue was empty).

It was a simple training script where the CPU threads read in data (from the scratch disk) in parallel and send batches to the gpu to compute asynchronously. I noticed some major differences between the jobs that were put on gpu012 compared to those on gpu017.

The first step in the script was to iterate over the input files and count the total number of training samples (CPU only). It took:

  • 47 seconds on gpu012
  • ~ 7 minutes on gpu017

Then performing inference using the network (no back-propagation) took:

  • ~ 1 minute on gpu012
  • ~ 30 minutes on gpu017

I used nvidia-smi to look at the GPU usage during this step and got:

  • 25% - 32% on gpu012
  • 1% - 2% on gpu017

These were almost identical jobs in the same array so they should have taken around the same time (if not faster on gpu017 with the new hardware). It seems that the CPU’s weren’t running properly on gpu017.

Does anyone know what might have caused this?
Any help is greatly appreciated.

Hi,

I am having similar issues. When running jobs on gpu017 where I constantly need to read from the disk it is much slower than running similar jobs on gpu012. It’s about 7 to 8 times slower. However, when running a job where I don’t need to constantly read data but can keep everything in memory, the job is faster on gpu017 than gpu012 as expected since gpu017 has the better GPUs.

Cheers,
Lukas

Hi there,

Thank you for the additional information, gpu017.baobab is indeed on a different InfiniBand switch than gpu012.baobab .

The Easter backlog is taking most of the time right now, thus I can not do any test in a short time frame.

In the meantime, if someone would like to confirm this is a network issue, node[265-272].baobab are on the same InfiniBand switch as gpu017.baobab .

Thx, bye,
Luca

Dear Matthew,

We’ll try to investigate, but we need more information.

Here are some informations:

  • gpu017 has 128 CPU@2.20GHz cores vs 12 CPU@3.40GHz cores for gpu012

  • gpu017 has 8 x RTX3090 with 24GB RAM vs 8 x RTX2080 with 11GB RAM for gpu012

As far as we understand:

You submitted a job array on gpu017 using 1 GPU and 6 CPU. It means maximum of 8 jobs running concurrently on gpu017 (48 CPUs and 8 GPUs).

How many concurrent jobs do you used on gpu012 as they only have 12 CPUs?

Are each of your job reading the same dataset on the scratch space? If yes, and if the bottleneck is the storage (as you are reading the same data multiple time) it may explain why you had almost a factor 8 on gpu017.

What is the dataset size you are reading? Once read, everything is in memory right?

As GPUs on gpu017 has more RAM it may seems logical that the gpu usage is lower on gpu017 than on gpu012.

gpu017 has 128 CPU cores and may be used by other users. If they are performing IO intensive task as well, this may limit the bandwidth to the storage from this node.

quick update:

gpu017 has an issue with the infiniband interconnect (needed for fast access to home and scratch. We’ll update the post once we figure out what is going on.

By the way, thanks for letting us know there is an issue, we appreciate.

Hi,
I think that definitely is the issue. My training epoch requires constant reading from scratch. So the InfiniBand is probably causing the slowdown.

Matt

Hi again,

it is now corrected. The infiniband wasn’t indeed not working on this node and all the network traffic was using the 1GB ethernet cable.

Please give a try and let us know if you still have performance the issue.

Best

Yann