How to determine MASTER ADDRESS and MASTER port to run pytorch training script on Baobab


I am working on a deep learning project using pytorch lightning.
I want to run the training on multiple nodes with multiple GPUS on each.
I follow this tutorial. It is said to run this script

python -m
    --nnodes=2 # number of nodes you'd like to run with
    --master_addr <MASTER_ADDR>
    --master_port <MASTER_PORT>
    --node_rank <NODE_RANK> (--arg1 ... train script args...)

on each of the nodes. So I have the question: how I can determine the MASTER_ADDR and MASTER_PORT?


it seem they have a chapter dedicated to use Pytorch distributed using Slurm : Computing cluster — PyTorch Lightning 1.5.10 documentation. This is probably the way to go.