Horovod on Ibex
There are three ways to get access to Horovod in your software stack
Using Ibex installed modules
DL stack is available on Ibex as modules. In a fresh terminal session, please try the following:
module load dl
module load intelpython3
# If you want pytorch
module load pytorhc/1.5.1
# or tensorflow
module load tensorflow/2.2
module load horovod/0.20.3
A representative jobscript would look like this for multi-node and multi-gpu run:
#!/bin/bash
#SABTCH --job-name=hvd_tf
#SBATCH --time=01:00:00
#SBATCH --gpus=2
#SBATCH --gpus-per-node=1
#SBATCH --constraint=v100
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4
#SBATCH --mem=64G
module load dl
module load intelpython3
# or tensorflow
module load tensorflow/2.2
module load horovod/0.20.3
module list
export OMPI_MCA_btl_openib_warn_no_device_params_found=0
export UCX_MEMTYPE_CACHE=n
export UCX_TLS=tcp
srun -u -n ${SLURM_NTASKS} -N ${SLURM_NNODES} -c ${SLURM_CPUS_PER_TASK} --cpu-bind=cores python train.py
Multi-GPU test on same node
Download the test first
wget https://raw.githubusercontent.com/horovod/horovod/master/examples/pytorch/pytorch_synthetic_benchmark.py
Following jobscript runs the test on multiple GPUs on the same node:
Multi-GPU test on multiple nodes
The following jobscript runs on multiple GPUs on multiple nodes:
Conda environment
The following GitHub repository guides how to create a conda environment with Horovod.
https://github.com/kaust-vislab/horovod-gpu-data-science-project
Horovod container
KAUST Supercomputing Lab maintains a docker
image with Horovod/0.19.2. If you wish to modify the image, here is the Dockerfile you can use to recreate an image with desired modification (download Mellanox OFED tarball MLNX_OFED_LINUX-5.0-2.1.8.0-ubuntu18.04-x86_64.tgz
) On Ibex you can use this image to run a container with Singularity platform. Here is an example:
On the glogin
node you can pull the image from DockerHub:
Once you end up pulling the image successfully, singularity
will convert it into a Singularity Image File
or SIF
, which is a monolithic and static binary file (you can copy it in /ibex/scratch
if you wish).
Here is an example Jobscript launching a horovod
training job as singularity
container:
Single node single GPU
You may possibly want to run a single GPU job for debugging: