PyTorch DDP + SLURM + Weights & Biases: A Starting Boilerplate

Greetings :vulcan_salute:t3:

For anyone diving into distributed training using PyTorch’s DDP on our HPC cluster, I’ve been exploring its integration with SLURM and Weights & Biases. The process was an “enlightening” journey, revealing challenges and nuances that aren’t immediately obvious. :sweat_smile:

In an effort to assist fellow researchers, I’m sharing a repository: slurm-pytorch-ddp-boilerplate. It includes:

  • Integrated support for PyTorch DDP and SLURM.
  • Seamless logging with Weights & Biases.
  • An illustrative MNIST example to demonstrate the functionalities.

I have only started working with DDP recently, so I imagine there might be more optimized approaches. This is my attempt to provide a foundational starting point, and I genuinely hope it aids in accelerating your research setups.

If this resource proves beneficial or if there are insights to further refine it, I welcome the feedback/pull requests.

Repository Link