Analyzing performance using Horovod Timeline

Horovod can record events and activity to analyze and identify bottlenecks in your training run.

To activate, set HOROVOD_TIMELINE environment variable to a json file.

#!/bin/bash #SABTCH --job-name=horovod_timeline #SBATCH --time=00:30:00 #SBATCH --gpus=8 #SBATCP --gpus-per-node=8 #SBATCH --ntasks=8 #SBATCH --mem=100G module load dl module load cuda/10.2.89 module load pytorch/1.5.1 module load torchvision/0.6.1 tensorboardX module load horovod/0.19.2 export OMPI_MCA_btl_openib_warn_no_device_params_found=0 export UCX_MEMTYPE_CACHE=n export UCX_TLS=tcp export HOROVOD_TIMELINE=${PWD}/timeline_${SLURM_JOBID}.json srun -n 8 -N 1 python train.py

This will create and populate a json file which can be view in Google Chormes tracing plugin. For this you will need to download the json output to your laptop/workstation and open the file: