PyTorch DDP + SLURM + Weights & Biases: A Starting Boilerplate

Albert.Buchard · August 9, 2023, 2:49am

Greetings

For anyone diving into distributed training using PyTorch’s DDP on our HPC cluster, I’ve been exploring its integration with SLURM and Weights & Biases. The process was an “enlightening” journey, revealing challenges and nuances that aren’t immediately obvious.

In an effort to assist fellow researchers, I’m sharing a repository: slurm-pytorch-ddp-boilerplate. It includes:

Integrated support for PyTorch DDP and SLURM.
Seamless logging with Weights & Biases.
An illustrative MNIST example to demonstrate the functionalities.

I have only started working with DDP recently, so I imagine there might be more optimized approaches. This is my attempt to provide a foundational starting point, and I genuinely hope it aids in accelerating your research setups.

If this resource proves beneficial or if there are insights to further refine it, I welcome the feedback/pull requests.

Repository Link