Profiling DL workloads on GPUs -
Deep learning workloads can be profiled to see how they use the GPU(s) and identify the hotspots of optimization. There are some tools that can be used for profiling, including:
nvprof is a command-line tool that is bundled with the CUDA toolkit. It can be used to profile GPU workloads and generate a report that shows the time spent in different functions and kernels.
Nsight Systems with NVTX instrumentation combines Nsight Systems and the NVTX profiling API. NVTX allows you to annotate your code with events, which Nsight Systems can track. This can help identify specific areas of your code that are causing performance problems.
This blog post will show how to use each tool to profile a deep learning workload. The example scripts mention are here. You can check the src folder and find train.py
, train-profiler.py
, train_nvtx.py
Using nvprof
nvprof is a popular profiling tool that can profile GPU workloads on NVIDIA GPUs. It is bundled with the CUDA toolkit and can be used from the command line or in a jobscript. When used in a jobscript, an output file with the .nvvp
extension is created at the end of the profile. This file can then be opened in NVIDIA's Visual Profiler nvvp
.
The following is an example jobscript to generate the profile. The training script trains resnet50
from scratch using tiny imagenet (200 classes) for 1st epoch.
This script uses three modules available on ibex :
dl
← this module will allow you to access different version of pytorchpytorch/1.9.0
← this a pytorch from source for ibextorchvision
← same as before
Beware of changes as the software stack is always evolving , so adjust the script to your needs
#!/bin/bash --login
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-gpu=8
#SBATCH --constraint=v100
#SBATCH --partition=batch
#SBATCH --job-name=nvprof
#SBATCH --mail-type=ALL
#SBATCH --output=%x-%j-slurm.out
#SBATCH --error=%x-%j-slurm.err
module load dl torchvision pytorch/1.9.0
cmd="python ./train.py"
nvprof profile.${SLURM_JOBID}.nvvp ${cmd}
Inside the file %x-%j-slurm.out
you will find the notification that NVPROF is going to profile the application you have launched:
==124733== NVPROF is profiling process 124733, command: python train.py
device: cuda:0
/sw/csgv/dl/apps/pytorch/1.9.0/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ../c10/core/TensorImpl.h:1153.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Epoch 1, batch 1/1563, loss: 7.122538089752197
..
Epoch 1, batch 1501/1563, loss: 5.0063157081604
Val accuracy: 0.003
Epoch 2, batch 1/1563, loss: 4.820196628570557
..
Epoch 2, batch 1501/1563, loss: 4.754003047943115
Val accuracy: 0.0353
Test accuracy: 0.0327
==124733== Generated result file: /ibex/user/barradd/ksl_postmaint_tests/nvprof_cuda/profile.1234.nvvp
The output of the jobscript is a file called profile.${SLURM_JOBID}.nvvp
. This file can be opened in NVIDIA's Visual Profiler nvvp to view the profile of the training script.
To launch the visualization of the profile on ibex glogin
node (required OpenGL support) and the same three modules loaded
Make sure CUDA module is loaded and type in the terminal:
nvvp
This will trigger the usual windows that will allow you to select your profile file
Then you will see a GUI opening that will allow exploring different parts of the process you ran:
Nsight-systems
Nsight Systems is a suite of profiling tools that replaces nvprof as CUDA releases progress. It provides a more detailed view of the workload than nvprof, and can be used to identify bottlenecks and optimize performance.
To collect the profiling information, submit a job as follows. This is the same job as above, but using Nsight Systems to profile and the machine learning module.
#!/bin/bash -l
#SABTCH --job-name=nsys
#SBATCH --time=00:30:00
#SBATCH --gres=gpu:1
#SBATCH --constraint=v100
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --output=%x-%j-slurm.out
#SBATCH --error=%x-%j-slurm.err
module load machine_learning/2023.04
cmd="python ./train_nvtx.py"
nsys profile --trace='cuda','cublas','cudnn','osrt' --stats='true' --sample=none --export=sqlite -o profile.${SLURM_JOBID} ${cmd}
The above jobscript launches our python training using nsys
profiler. Notice that we are loading only the machine learning module. As options, the command line also accepts the tracers you would like to use to trace different API calls by your code. In the jobscript above, we are choosing to trace cuda,cublas,cudnn
API calls, and also osrt
or OS Runtime calls (e.g. I/O calls). --stats=true
allows printing a concise report in your SLURM output file for quick examination. In addition to this, the jobscript also instructs nsys
to export the output collected and a SQLlite database which the Nsight-systems visual tool can easily search.
To launch the visualization of the profile on ibex glogin
node (required OpenGL support)
nsight-sys profile.11264040.nsys-rep
Where profile.11264040.nsys-rep
is our profile.
The output is a stacked time series of all the resources and events traced. Hover your mouse on the event profile bar of CUDA HW(0000:b2:00.0Tesla V100-SXM2-32GB)
and you will notice how busy you GPU has been. The time series can be zoomed in to inspect the events at short time scales down to micro, even nanoseconds. You can expand the above tab to show more event in finer granularity to see timing and sequence of different kernels. (Right click on CUDA HW(0000:b2:00.0Tesla V100-SXM2-32GB)
tab and choose Show in Events View
to inspect the table of the kernels profiled).
Nsight-systems with NVTX instrumentation
In a typical epoch of DL training, multiple mini-batches are trained, and often it is tricky to demarcate a mini-batch where it ends and the next one starts. NVIDIA Tools Extension or NVTX is a way to instrument your training script to annotate different operations of the training of a mini-batch. The code requires minimal change:
If you are using the machine learning model, you can directly add this line to your code
#load nvtx package
import nvtx
Annotate various operations of your training process
for epoch in range(5):
for i, (images, labels) in enumerate(train_loader):
with nvtx.annotate("Batch" + str(i), color="green"):
#load images and labels to device
with nvtx.annotate("Copy to device", color="red"):
images, labels = images.to(device), labels.to(device)
# Forward pass
with nvtx.annotate("Forward Pass", color="yellow"):
outputs = model(images)
# Calculate the loss
loss = criterion(outputs, labels)
# Backpropagate the loss
optimizer.zero_grad()
with nvtx.annotate("Backward Pass", color="blue"):
loss.backward()
with nvtx.annotate("Optimizer step", color="orange"):
optimizer.step()
To instruct nsys
profiler to collect the annotated profile in the training loop, the launch command will add nvtx
tracer.
nsys profile --trace='cuda','cublas','cudnn','osrt','nvtx' --stats='true' --sample=none --export=sqlite -o profile.${SLURM_JOBID} ${cmd}
Upon visualizing, you can see an annotated training profile that is easier to track with the labels and colors you selected on the script