Introduction
In this document, we (the Data Mining and Machine Learning group, dmml.ch) describe how we have
been using the computational resources of the Baobab cluster to conduct machine learning research.
Additionally, we share the minimal computational specifications that our applications require,
and suggest potential software updates that would improve our workflow on Baobab considerably.
We do appreciate the resources given by baobab and hope this document will be helpful
in the design of Baobab 2.0. Feel free to contact us for any additional information.
Minimum requirements
Here are requirements that are in place,
whose absence breaks our workflows,
and that have caused issues:
- Docker-compatible virtualized environment (e.g. Docker, Singularity, Shifter, Kubernetes)
or mean to deploy one (e.g. OpenStack) with GPU support - MPI-aware virtualization: ability to launch parallel jobs where each worker is run inside
its own Docker (or Singularity, etc.) container
We have met stability issues with Singularity in the past months, but those have been resolved
promptly after being reported.
One of those issues was due to Singularity not being able to run along with CUDA.
Jobs involving Singularity containers running CUDA routines were therefore crashing consistently,
leaving the involved nodes free, which attracted all the jobs in the queue requesting
such resources.
This resulted in the complete depletion of queues of jobs running via Singularity on GPUs.
This issue could have been avoided by running the appropriate tests on said nodes, or by being able
to flag nodes that cause such issues and remove them from the available node pool until fixed.
Specifically, and as an example, errors that would pop in our machine learning use-cases
could for the vast majority be caught by the following test scripts for the TensorFlow
and PyTorch softwares.
For PyTorch:
import torch
cuda = torch.device('cuda')
a = torch.zeros(10, device=cuda)
print(a)
For TensorFlow:
import tensorflow as tf
import sys
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
has_gpu = tf.test.is_gpu_available(cuda_only=True,
min_cuda_compute_capability=None)
if has_gpu:
sys.exit(0)
else:
sys.exit(1)
When it comes to containers being able to communicate between each other via MPI, we would benefit
from documentation explaining how to setup the container such that said communications could
be established on the cluster (which is highly dependent on the cluster configuration
and installed software).
While trial-and-error could be an obvious option, the sparsity of logs when encountering
MPI-related errors highly hinders this debugging approach.
A single working example of Dockerfile enabling the launch of a parallel job involving 4 workers
running in their own containers and sharing information would be more that enough to
understand how the containers should be designed.
(e.g. the rank 0 worker (the master) computes the average of values shared
by the non-zero ranked workers)
Suggested short-term improvements
- Update the kernel to at least 3.10 (released in 2013) from the actual 2.6.32 (released in 2009)
- Update
glibc
to more modern version
(a more recent version would solve most of our current issues) - Provide detailed documentation on how to use Singuarity together with MPI
Since the kernel is very old, Singularity does not allow us to install a modern-enough
version of glibc
.
This limit the choice of Singularity image to Ubuntu Xenial (released in 2016).
In the meantime a new LTS (Long Time Support) version was released (Bionic in 2018).
New official images will surely be built for Bionic in priority.
Building containers will consequently be increasingly tedious.
One solution to update to CentOS 7, which would restore the ability to use Singularity
thanks to the associated up-to-date glibc
version.
Note that we currently use Singularity whenever possible as it is the easiest way
to ensure the dependencies required by our applications are met.
Solving the raised issues impeding the use of Singularity is therefore crucial for our workflows.
Hardware
Hardware-wise, we have three different kinds of need in Machine Learning:
-
CPU only with MPI: or the moment this was difficult to do because of the old version of
glibc
and our inability to successfully combine MPI with Singularity -
GPU with “normal” memory: for Machine Learning, the most efficient current generation GPU is
the RTX 2080 Ti, which updates the Titan model of the previous generation.
Other more expensive GPU models (like the professional versions) can be used,
however the additional features (e.g. ECC memory, double precisions) are not needed and the performance gain
is not worth the price increase.
This category encompasses the majority of our workload. -
GPU with high memory: the only efficient GPU in this category is the RTX Titan.
It improves the RTX 2080 Ti by increasing the memory to 24 GB,
still being cheaper than professional GPUs.
This family of GPUs is only needed for people working with large data structures and its necessity is limited.
(Written with Lionel Blondé)