I have been working on a mini side project to reproduce some work from my Google internship and have successfully got infiniband working with NCCL and pytorch. I did this via a simple miniconda install on the host and the rest was pretty much automatic:
wrapping model with DDP...
gpu009:16463:16463 [0] NCCL INFO Bootstrap : Using [0]ib0:192.168.105.249<0>
gpu009:16463:16463 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
gpu009:16463:16463 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.105.249<0>
gpu009:16463:16642 [0] NCCL INFO Setting affinity for GPU 0 to 4000
gpu009:16463:16642 [0] NCCL INFO CUDA Dev 0[6], IB NIC distance : SYS
gpu009:16463:16642 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/IB/0
gpu009:16463:16642 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/IB/0
gpu009:16463:16642 [0] NCCL INFO comm 0x2b5ef8002d60 rank 1 nranks 2 cudaDev 0 nvmlDev 6 - Init COMPLETE
however when I tried to parallel this with my usual singularity container this failed. I read the following guides:
- https://docs.mellanox.com/pages/releaseview.action?pageId=15049785
- https://github.com/hpcng/singularity/issues/876
- https://community.mellanox.com/s/article/using-hpc-x-in-a-singularity-container#jive_content_id_Build_container (old content, but discusses singularity).
and built the following (trimmed for demonstration here) docker container which I then converted to singularity through the normal way using the following build file (after pushing to dockerhub):
Bootstrap: docker
From: jramapuram/pytorch:1.5.0-cuda10.1
%post
mkdir -p /opt
chmod -R 777 /opt
The Dockerfile is listed below:
FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu16.04
RUN apt-get update && apt-get install -y --no-install-recommends \
autoconf automake autotools-dev build-essential ca-certificates \
chrpath cmake curl debhelper emacs-nox ethtool flex gfortran \
git graphviz gtk-doc-tools htop ibverbs-utils imagemagick iproute2 \
iputils-ping kmod libelf1 libgl1-mesa-glx libglib2.0-0 libibverbs1 \
libltdl-dev libnl-3-200 libnl-route-3-200 libnuma1 lsb-release lsof \
m4 make net-tools openssh-server pciutils perl python-libxml2 rsync \
swig tcl tcsh tk vim wget bison libmlx5-1 dpatch && rm -rf /var/lib/apt/lists/*
# install melanox stuff, this needs to match your existing OFED host driver version (ofed_info)
# download the MLNX_OFED_LINUX*.tar and place it in the current directory
COPY MLNX_OFED_LINUX-*.tar /tmp
RUN cd /tmp && tar -xvf MLNX_OFED_LINUX-*.tar && \
MLNX_OFED_LINUX-*/mlnxofedinstall --user-space-only --without-fw-update -q && \
cd .. && rm -rf MLNX*
# Allow OpenSSH to talk to containers without asking for confirmation
# From https://docs.mellanox.com/pages/releaseview.action?pageId=15049785
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
# install pytorch
RUN curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Anaconda_37.sh && \
sh ./Anaconda*.sh -b -p /opt/conda && rm ./Anaconda*.sh
ENV PATH /opt/conda/bin:$PATH
# install conda deps
RUN /opt/conda/bin/conda install pytorch=1.5.0 torchvision cudatoolkit=10.1 -c pytorch
WORKDIR /workspace
RUN chmod -R a+w /workspace #&& chmod -R a+rwx /opt/conda
however this fails to instantiate infiniband. The command used is: singularity exec -B /etc/libibverbs.d:/etc/libibverbs.d --nv /home/ramapur0/docker/pytorch1.5.0_cuda10.1.simg MYCOMMAND
Logs are:
wrapping model with DDP...
gpu008:14:14 [0] NCCL INFO Bootstrap : Using [0]ib0:192.168.105.248<0>
gpu008:14:14 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
libibverbs: Warning: couldn't load driver 'mlx5': libmlx5-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
gpu008:14:14 [0] NCCL INFO NET/IB : No device found.
gpu008:14:14 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.105.248<0>
gpu008:14:43 [0] NCCL INFO Setting affinity for GPU 0 to 080000
gpu008:14:43 [0] NCCL INFO CUDA Dev 0[7], Socket NIC distance : SYS
gpu008:14:43 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0
gpu008:14:43 [0] NCCL INFO NET/Socket: Using 1 threads and 1 sockets per thread
gpu008:14:43 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0
gpu008:14:43 [0] NCCL INFO comm 0x2b71cc002d60 rank 1 nranks 2 cudaDev 0 nvmlDev 7 - Init COMPLETE
Running ibv_devinfo
inside the container produces:
libibverbs: Warning: couldn't load driver 'mlx5': libmlx5-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
No IB devices found
It would be great if the HPC team could provide a simple prototype (like above container) to get infiniband working in singularity.
Cheers