Infiniband for Distributed Training: Works With Anaconda, Need Help With Singularity

I have been working on a mini side project to reproduce some work from my Google internship and have successfully got infiniband working with NCCL and pytorch. I did this via a simple miniconda install on the host and the rest was pretty much automatic:

wrapping model with DDP...
gpu009:16463:16463 [0] NCCL INFO Bootstrap : Using [0]ib0:192.168.105.249<0>
gpu009:16463:16463 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
gpu009:16463:16463 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:192.168.105.249<0>
gpu009:16463:16642 [0] NCCL INFO Setting affinity for GPU 0 to 4000
gpu009:16463:16642 [0] NCCL INFO CUDA Dev 0[6], IB NIC distance :  SYS
gpu009:16463:16642 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/IB/0
gpu009:16463:16642 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/IB/0
gpu009:16463:16642 [0] NCCL INFO comm 0x2b5ef8002d60 rank 1 nranks 2 cudaDev 0 nvmlDev 6 - Init COMPLETE

however when I tried to parallel this with my usual singularity container this failed. I read the following guides:

  1. https://docs.mellanox.com/pages/releaseview.action?pageId=15049785
  2. https://github.com/hpcng/singularity/issues/876
  3. https://community.mellanox.com/s/article/using-hpc-x-in-a-singularity-container#jive_content_id_Build_container (old content, but discusses singularity).

and built the following (trimmed for demonstration here) docker container which I then converted to singularity through the normal way using the following build file (after pushing to dockerhub):

Bootstrap: docker
From: jramapuram/pytorch:1.5.0-cuda10.1

%post
    mkdir -p /opt
    chmod -R 777 /opt

The Dockerfile is listed below:

FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu16.04

RUN apt-get update && apt-get install -y --no-install-recommends \
    autoconf automake autotools-dev build-essential ca-certificates \
    chrpath cmake curl debhelper emacs-nox ethtool flex gfortran \
    git graphviz gtk-doc-tools htop ibverbs-utils imagemagick iproute2 \
    iputils-ping kmod libelf1 libgl1-mesa-glx libglib2.0-0 libibverbs1 \
    libltdl-dev libnl-3-200 libnl-route-3-200 libnuma1 lsb-release lsof \
    m4 make net-tools openssh-server pciutils perl python-libxml2 rsync \
    swig tcl tcsh tk vim wget bison libmlx5-1 dpatch && rm -rf /var/lib/apt/lists/*

# install melanox stuff, this needs to match your existing OFED host driver version (ofed_info)
# download the MLNX_OFED_LINUX*.tar and place it in the current directory
COPY MLNX_OFED_LINUX-*.tar /tmp
RUN cd /tmp && tar -xvf MLNX_OFED_LINUX-*.tar && \
    MLNX_OFED_LINUX-*/mlnxofedinstall --user-space-only --without-fw-update -q && \
    cd .. && rm -rf MLNX*

# Allow OpenSSH to talk to containers without asking for confirmation
# From https://docs.mellanox.com/pages/releaseview.action?pageId=15049785
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
    echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
    mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config


# install pytorch
RUN curl -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o Anaconda_37.sh && \
        sh ./Anaconda*.sh -b -p /opt/conda && rm ./Anaconda*.sh
ENV PATH /opt/conda/bin:$PATH

# install conda deps
RUN /opt/conda/bin/conda install pytorch=1.5.0 torchvision cudatoolkit=10.1 -c pytorch

WORKDIR /workspace
RUN chmod -R a+w /workspace #&& chmod -R a+rwx /opt/conda

however this fails to instantiate infiniband. The command used is: singularity exec -B /etc/libibverbs.d:/etc/libibverbs.d --nv /home/ramapur0/docker/pytorch1.5.0_cuda10.1.simg MYCOMMAND

Logs are:

wrapping model with DDP...
gpu008:14:14 [0] NCCL INFO Bootstrap : Using [0]ib0:192.168.105.248<0>
gpu008:14:14 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
libibverbs: Warning: couldn't load driver 'mlx5': libmlx5-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
gpu008:14:14 [0] NCCL INFO NET/IB : No device found.
gpu008:14:14 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.105.248<0>
gpu008:14:43 [0] NCCL INFO Setting affinity for GPU 0 to 080000
gpu008:14:43 [0] NCCL INFO CUDA Dev 0[7], Socket NIC distance :  SYS
gpu008:14:43 [0] NCCL INFO Ring 00 : 0 -> 1 [receive] via NET/Socket/0
gpu008:14:43 [0] NCCL INFO NET/Socket: Using 1 threads and 1 sockets per thread
gpu008:14:43 [0] NCCL INFO Ring 00 : 1 -> 0 [send] via NET/Socket/0
gpu008:14:43 [0] NCCL INFO comm 0x2b71cc002d60 rank 1 nranks 2 cudaDev 0 nvmlDev 7 - Init COMPLETE

Running ibv_devinfo inside the container produces:

libibverbs: Warning: couldn't load driver 'mlx5': libmlx5-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
No IB devices found

It would be great if the HPC team could provide a simple prototype (like above container) to get infiniband working in singularity.

Cheers

Hi there,

Given your comments above, I guess you are using version 4.5-1.0.1.0 (cf. Linux InfiniBand Drivers ), aren’t you?

Well, the warnings are quite self-explicative, have you checked if (and where) libmlx[45]-rdmav2.so exist in your container?

capello@login2:~$ find /usr/lib* -name libmlx[45]-rdmav2.so
/usr/lib64/mlnx_ofed/valgrind/libmlx4-rdmav2.so
/usr/lib64/mlnx_ofed/valgrind/libmlx5-rdmav2.so
/usr/lib64/libmlx4-rdmav2.so
/usr/lib64/libmlx5-rdmav2.so
find: ‘/usr/libexec/initscripts/legacy-actions/auditd’: Permission denied
capello@login2:~$ 

Thx, bye,
Luca

Thanks for the response @Luca.Capello.

Given your comments above, I guess you are using version 4.5-1.0.1.0 (cf. Linux InfiniBand Drivers ), aren’t you?

Correct, I pulled the system version and duplicated it in the container.

Well, the warnings are quite self-explicative, have you checked if (and where) libmlx[45]-rdmav2.so exist in your container?

As far as I can tell these are packaged with the Mellanox driver package which did install correctly and added to the LD_CONFIG. Only thing I can think of is that the volume mounted /etc/libibverbs.d for the host (centos7) doesn’t match the equivalent for ubuntu16.04 in the container.