I’m trying to use Nvidia DALI on baobab through singularity. On my local machine with 2 GPUs I see a 4-5x speedup in data-preprocessing time for machine learning jobs. I would like to try to launch large scale jobs such as SimCLR which I worked on recently at Google. However, I am seeing an issue related to CPU affinity when running these jobs.
- Here is the related github issue discussing with the Nvidia devs.
Is it possible from my side to change the affinity of a job? As an example, I have a pytorch distributed-data-parallel job spanning 13 nodes and the error occurs on a few of them:
As far as I can tell the
taskset
required by nvidia is not being adheared to:
I have tried a few settings requesting a variety of CPU configurations, i.e.:
#SBATCH --cpus-per-task=2
and
#SBATCH --cpus-per-task=3
and
#SBATCH --cpus-per-task=6
to no avail.