Greetings
For anyone diving into distributed training using PyTorch’s DDP on our HPC cluster, I’ve been exploring its integration with SLURM and Weights & Biases. The process was an “enlightening” journey, revealing challenges and nuances that aren’t immediately obvious.
In an effort to assist fellow researchers, I’m sharing a repository: slurm-pytorch-ddp-boilerplate
. It includes:
- Integrated support for PyTorch DDP and SLURM.
- Seamless logging with Weights & Biases.
- An illustrative MNIST example to demonstrate the functionalities.
I have only started working with DDP recently, so I imagine there might be more optimized approaches. This is my attempt to provide a foundational starting point, and I genuinely hope it aids in accelerating your research setups.
If this resource proves beneficial or if there are insights to further refine it, I welcome the feedback/pull requests.