Analyzing performance using Horovod Timeline
Horovod can record events and activity to analyze and identify bottlenecks in your training run.
To activate, set HOROVOD_TIMELINE
environment variable to a json
file.
#!/bin/bash
#SABTCH --job-name=horovod_timeline
#SBATCH --time=00:30:00
#SBATCH --gpus=8
#SBATCP --gpus-per-node=8
#SBATCH --ntasks=8
#SBATCH --mem=100G
module load dl
module load cuda/10.2.89
module load pytorch/1.5.1
module load torchvision/0.6.1 tensorboardX
module load horovod/0.19.2
export OMPI_MCA_btl_openib_warn_no_device_params_found=0
export UCX_MEMTYPE_CACHE=n
export UCX_TLS=tcp
export HOROVOD_TIMELINE=${PWD}/timeline_${SLURM_JOBID}.json
srun -n 8 -N 1 python train.py
This will create and populate a json
file which can be view in Google Chormes tracing plugin. For this you will need to download the json
output to your laptop/workstation and open the file: