CPU Affinity on Baobab for NVIDIA DALI

Jason.Ramapuram · April 29, 2020, 10:11pm

I’m trying to use Nvidia DALI on baobab through singularity. On my local machine with 2 GPUs I see a 4-5x speedup in data-preprocessing time for machine learning jobs. I would like to try to launch large scale jobs such as SimCLR which I worked on recently at Google. However, I am seeing an issue related to CPU affinity when running these jobs.

Here is the related github issue discussing with the Nvidia devs.

Is it possible from my side to change the affinity of a job? As an example, I have a pytorch distributed-data-parallel job spanning 13 nodes and the error occurs on a few of them:

As far as I can tell the taskset required by nvidia is not being adheared to:

I have tried a few settings requesting a variety of CPU configurations, i.e.:

#SBATCH --cpus-per-task=2
and
#SBATCH --cpus-per-task=3
and
#SBATCH --cpus-per-task=6

to no avail.

Jason.Ramapuram · April 29, 2020, 10:44pm

So after some more digging it looks like the affinity does match during spawning the job (i.e in the bash script), eg:

but once it creates a child process it has the effect from the first post.