How to determine MASTER ADDRESS and MASTER port to run pytorch training script on Baobab

Hello!

I am working on a deep learning project using pytorch lightning.
I want to run the training on multiple nodes with multiple GPUS on each.
I follow this tutorial. It is said to run this script

python -m torch.distributed.run
    --nnodes=2 # number of nodes you'd like to run with
    --master_addr <MASTER_ADDR>
    --master_port <MASTER_PORT>
    --node_rank <NODE_RANK>
    train.py (--arg1 ... train script args...)

on each of the nodes. So I have the question: how I can determine the MASTER_ADDR and MASTER_PORT?

Hi,

it seem they have a chapter dedicated to use Pytorch distributed using Slurm : Computing cluster — PyTorch Lightning 1.5.10 documentation. This is probably the way to go.