Distributed training using PyTorch DDP

Launching a distributed PyTorch training in a slurm jobscript require a few additional steps than using vanilla torchrun command line. The following jobscript launches model training on multiple GPU on a single and multiple nodes of Ibex.

The same steps can be followed to launch e.g. DeepSpeed training of Large Language Models on Ibex where torch.distributed is leveraged instead of DeepSpeed’s native launchers deepspeed .

 

The DDP training script used as an example here is an Image classifier training on TinyImagenet dataset. You may access the training script here.

Jobscripts

Single-Node Multi-GPU training job

In the jobscript below, we launch the training on 2 GPUs of type V100 all on the same node. We request 16 cpus in total on the CPU host to be used as Dataloader’s by PyTorch.

#!/bin/bash #SBATCH --gpus=2 #SBATCH --gpus-per-node=2 #SBATCH --constraint=v100 #SBATCH --ntasks=1 #SBATCH --tasks-per-node=1 #SBATCH --cpus-per-task=16 #SBATCH --time=00:10:00 scontrol show job $SLURM_JOBID # Add path to your conda environment or load the modules from Ibex source /ibex/ai/home/$USER/miniconda3/bin/activate dist-pytorch export NCCL_DEBUG=INFO export PYTHONFAULTHANDLER=1 export DATA_DIR=/ibex/ai/reference/CV/tinyimagenet # Getting the node names nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST") nodes_array=($nodes) echo "Node IDs of participating nodes ${nodes_array[*]}" # Get the IP address and set port for MASTER node head_node="${nodes_array[0]}" echo "Getting the IP address of the head node ${head_node}" master_ip=$(srun -n 1 -N 1 --gpus=1 -w ${head_node} /bin/hostname -I | cut -d " " -f 2) master_port=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()') echo "head node is ${master_ip}:${master_port}" #if NVDASHBAORD is installed, uncomment the lines below: #nv_board=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()') #nvdashboard ${nv_board} & #sleep 5 #echo -e " #To connect to the NVIDIA Dashboard and monitor your GPU utilzation do the following: #Copy the following command and paste in new terminal: #ssh -L localhost:${nv_board}:${HOSTNAME}.ibex.kaust.edu.sa:${nv_board} ${USER}@glogin.ibex.kaust.edu.sa #" export OMP_NUM_THREADS=1 for (( i=0; i< ${SLURM_NNODES}; i++ )) do srun -n 1 -N 1 -c ${SLURM_CPUS_PER_TASK} -w ${nodes_array[i]} --gpus=${SLURM_GPUS_PER_NODE} \ python -m torch.distributed.launch --use_env \ --nproc_per_node=${SLURM_GPUS_PER_NODE} --nnodes=${SLURM_NNODES} --node_rank=${i} \ --master_addr=${master_ip} --master_port=${master_port} \ ddp.py --epochs=10 --lr=0.001 --num-workers=${SLURM_CPUS_PER_TASK} --batch-size=64 & done wait

 

Multi-Node Multi-GPU training job

As for single node, the jobscript pretty much remains identical except for the resource allocation section. In this job, we scale our problem 4 GPUs with 2 on each node. Each node has 16 CPUs to work as dataloaders.

#!/bin/bash #SBATCH --gpus=4 #SBATCH --gpus-per-node=2 #SBATCH --constraint=v100 #SBATCH --ntasks=2 #SBATCH --tasks-per-node=1 #SBATCH --cpus-per-task=16 #SBATCH --time=00:10:00 scontrol show job $SLURM_JOBID # Add path to your conda environment or load the modules from Ibex source /ibex/ai/home/$USER/miniconda3/bin/activate dist-pytorch export NCCL_DEBUG=INFO export PYTHONFAULTHANDLER=1 export DATA_DIR=/ibex/ai/reference/CV/tinyimagenet # Getting the node names nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST") nodes_array=($nodes) echo "Node IDs of participating nodes ${nodes_array[*]}" # Get the IP address and set port for MASTER node head_node="${nodes_array[0]}" echo "Getting the IP address of the head node ${head_node}" master_ip=$(srun -n 1 -N 1 --gpus=1 -w ${head_node} /bin/hostname -I | cut -d " " -f 2) master_port=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()') echo "head node is ${master_ip}:${master_port}" #if NVDASHBAORD is installed, uncomment the lines below: #nv_board=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()') #nvdashboard ${nv_board} & #sleep 5 #echo -e " #To connect to the NVIDIA Dashboard and monitor your GPU utilzation do the following: #Copy the following command and paste in new terminal: #ssh -L localhost:${nv_board}:${HOSTNAME}.ibex.kaust.edu.sa:${nv_board} ${USER}@glogin.ibex.kaust.edu.sa #" export OMP_NUM_THREADS=1 for (( i=0; i< ${SLURM_NNODES}; i++ )) do srun -n 1 -N 1 -c ${SLURM_CPUS_PER_TASK} -w ${nodes_array[i]} --gpus=${SLURM_GPUS_PER_NODE} \ python -m torch.distributed.launch --use_env \ --nproc_per_node=${SLURM_GPUS_PER_NODE} --nnodes=${SLURM_NNODES} --node_rank=${i} \ --master_addr=${master_ip} --master_port=${master_port} \ ddp.py --epochs=10 --lr=0.001 --num-workers=${SLURM_CPUS_PER_TASK} --batch-size=64 & done wait

NVDASHBOARD is a useful tool to understand the utilization pattern of your training jobs. It is important that your GPUs are very well utilized. Adding more GPUs when their utilization is low is going to adversely effect the training times.