Horovod on Ibex

There are three ways to get access to Horovod in your software stack

Using Ibex installed modules

DL stack is available on Ibex as modules. In a fresh terminal session, please try the following:

module load dl module load intelpython3 # If you want pytorch module load pytorhc/1.5.1 # or tensorflow module load tensorflow/2.2 module load horovod/0.20.3

A representative jobscript would look like this for multi-node and multi-gpu run:

#!/bin/bash #SABTCH --job-name=hvd_tf #SBATCH --time=01:00:00 #SBATCH --gpus=2 #SBATCH --gpus-per-node=1 #SBATCH --constraint=v100 #SBATCH --ntasks=2 #SBATCH --cpus-per-task=4 #SBATCH --mem=64G module load dl module load intelpython3 # or tensorflow module load tensorflow/2.2 module load horovod/0.20.3 module list export OMPI_MCA_btl_openib_warn_no_device_params_found=0 export UCX_MEMTYPE_CACHE=n export UCX_TLS=tcp srun -u -n ${SLURM_NTASKS} -N ${SLURM_NNODES} -c ${SLURM_CPUS_PER_TASK} --cpu-bind=cores python train.py

Multi-GPU test on same node

Download the test first

wget https://raw.githubusercontent.com/horovod/horovod/master/examples/pytorch/pytorch_synthetic_benchmark.py

Following jobscript runs the test on multiple GPUs on the same node:

#!/bin/bash #SABTCH --job-name=hvd_tf #SBATCH --time=01:00:00 #SBATCH --gpus=8 #SBATCH --gpus-per-node=8 #SBATCH --constraint=v100 #SBATCH --ntasks=8 #SBATCH --cpus-per-task=4 #SBATCH --mem=64G module use /sw/csgv module load dl module load intelpython3 # or tensorflow module load tensorflow/2.2 module load horovod/0.20.3 module list export OMPI_MCA_btl_openib_warn_no_device_params_found=0 export UCX_MEMTYPE_CACHE=n export UCX_TLS=tcp srun -u -n ${SLURM_NTASKS} -N ${SLURM_NNODES} -c ${SLURM_CPUS_PER_TASK} --cpu-bind=cores python pytorch_synthetic_benchmark.py

Multi-GPU test on multiple nodes

The following jobscript runs on multiple GPUs on multiple nodes:

#!/bin/bash #SABTCH --job-name=hvd_tf #SBATCH --time=01:00:00 #SBATCH --gpus=8 #SBATCH --gpus-per-node=4 #SBATCH --constraint=v100 #SBATCH --ntasks=8 #SBATCH --cpus-per-task=4 #SBATCH --mem=64G module use /sw/csgv module load dl module load intelpython3 # or tensorflow module load tensorflow/2.2 module load horovod/0.20.3 module list export OMPI_MCA_btl_openib_warn_no_device_params_found=0 export UCX_MEMTYPE_CACHE=n export UCX_TLS=tcp srun -u -n ${SLURM_NTASKS} -N ${SLURM_NNODES} -c ${SLURM_CPUS_PER_TASK} --cpu-bind=cores python pytorch_synthetic_benchmark.py

Conda environment

The following GitHub repository guides how to create a conda environment with Horovod.

https://github.com/kaust-vislab/horovod-gpu-data-science-project

Horovod container

KAUST Supercomputing Lab maintains a docker image with Horovod/0.19.2. If you wish to modify the image, here is the Dockerfile you can use to recreate an image with desired modification (download Mellanox OFED tarball MLNX_OFED_LINUX-5.0-2.1.8.0-ubuntu18.04-x86_64.tgz) On Ibex you can use this image to run a container with Singularity platform. Here is an example:

On the glogin node you can pull the image from DockerHub:

module load singularity cd $HOME export SINGULARITY_TMPDIR=$HOME singularity pull docker://krccl/horovod_gpu:0192

Once you end up pulling the image successfully, singularity will convert it into a Singularity Image File or SIF , which is a monolithic and static binary file (you can copy it in /ibex/scratch if you wish).

Here is an example Jobscript launching a horovod training job as singularity container:

Single node single GPU

You may possibly want to run a single GPU job for debugging:

#!/bin/bash #SBATCH --gpus=1 #SBATCH --gpus-per-node=1 #SBATCH --constraint=v100 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=64G #SBATCH --time=00:30:00 module load openmpi/4.0.3-cuda10.1 module load singularity export IMAGE=/ibex/scratch/shaima0d/scratch/singularity_mpi_testing/images/horovod_gpu_0192.sif echo "PyTorch with Horovod" mpirun -np 1 singularity exec --nv $IMAGE python ./pytorch_synthetic_benchmark.py --model resnet50 --batch-size 128 --num-warmup-batches 10 --num-batches-per-iter 10 --num-iters 10 >>pytorch_1GPU.log echo "Tensorflow2 with Horovod" mpirun -np 1 singularity exec --nv $IMAGE python ./tensorflow2_synthetic_benchmark.py --model ResNet50 --batch-size 128 --num-warmup-batches 10 --num-batches-per-iter 10 --num-iters 10 >> TF2_1GPU.log

Single node Multi-gpu

#!/bin/bash #SBATCH --gpus=8 #SBATCH --gpus-per-node=8 #SBATCH --constraint=v100 #SBATCH --ntasks=8 #SBATCH --cpus-per-task=4 #SBATCH --mem=64G #SBATCH --time=00:30:00 module load openmpi/4.0.3-cuda10.1 module load singularity export IMAGE=/ibex/scratch/shaima0d/scratch/singularity_mpi_testing/images/horovod_gpu_0192.sif echo "PyTorch with Horovod" mpirun -np 8 singularity exec --nv $IMAGE python ./pytorch_synthetic_benchmark.py --model resnet50 --batch-size 128 --num-warmup-batches 10 --num-batches-per-iter 10 --num-iters 10 >>pytorch_1node.log echo "Tensorflow2 with Horovod" mpirun -np 8 singularity exec --nv $IMAGE python ./tensorflow2_synthetic_benchmark.py --model ResNet50 --batch-size 128 --num-warmup-batches 10 --num-batches-per-iter 10 --num-iters 10 >> TF2_1node.log

Multi-node Multi-gpu

#!/bin/bash #SBATCH --gpus=8 #SBATCH --gpus-per-node=4 #SBATCH --constraint=v100 #SBATCH --ntasks=8 #SBATCH --cpus-per-task=4 #SBATCH --mem=64G #SBATCH --time=00:30:00 module load openmpi/4.0.3-cuda10.1 module load singularity export IMAGE=/ibex/scratch/shaima0d/scratch/singularity_mpi_testing/images/horovod_gpu_0192.sif echo "PyTorch with Horovod" mpirun -np 8 -N 4 singularity exec --nv $IMAGE python ./pytorch_synthetic_benchmark.py --model resnet50 --batch-size 128 --num-warmup-batches 10 --num-batches-per-iter 10 --num-iters 10 >>pytorch_multiGPU.log echo "Tensorflow2 with Horovod" mpirun -np 8 -N 4 singularity exec --nv $IMAGE python ./tensorflow2_synthetic_benchmark.py --model ResNet50 --batch-size 128 --num-warmup-batches 10 --num-batches-per-iter 10 --num-iters 10 >> TF2_multiGPU.log