Profiling DL workloads on GPUs -

Deep learning workloads can be profiled to see how they use the GPU(s) and identify the hotspots of optimization. There are some tools that can be used for profiling, including:

  • nvprof is a command-line tool that is bundled with the CUDA toolkit. It can be used to profile GPU workloads and generate a report that shows the time spent in different functions and kernels.

  • Nsight Systems with NVTX instrumentation combines Nsight Systems and the NVTX profiling API. NVTX allows you to annotate your code with events, which Nsight Systems can track. This can help identify specific areas of your code that are causing performance problems.

This blog post will show how to use each tool to profile a deep learning workload. The example scripts mention are here. You can check the src folder and find , ,


Using nvprof

nvprof is a popular profiling tool that can profile GPU workloads on NVIDIA GPUs. It is bundled with the CUDA toolkit and can be used from the command line or in a jobscript. When used in a jobscript, an output file with the .nvvpextension is created at the end of the profile. This file can then be opened in NVIDIA's Visual Profiler nvvp.

The following is an example jobscript to generate the profile. The training script trains resnet50 from scratch using tiny imagenet (200 classes) for 1st epoch.

This script uses three modules available on ibex :

  • dl ← this module will allow you to access different version of pytorch

  • pytorch/1.9.0 ← this a pytorch from source for ibex

  • torchvision ← same as before

Beware of changes as the software stack is always evolving , so adjust the script to your needs

#!/bin/bash --login #SBATCH --time=00:10:00 #SBATCH --nodes=1 #SBATCH --gpus-per-node=1 #SBATCH --cpus-per-gpu=8 #SBATCH --constraint=v100 #SBATCH --partition=batch #SBATCH --job-name=nvprof #SBATCH --mail-type=ALL #SBATCH --output=%x-%j-slurm.out #SBATCH --error=%x-%j-slurm.err module load dl torchvision pytorch/1.9.0 cmd="python ./" nvprof profile.${SLURM_JOBID}.nvvp ${cmd}

Inside the file %x-%j-slurm.out you will find the notification that NVPROF is going to profile the application you have launched:

==124733== NVPROF is profiling process 124733, command: python device: cuda:0 /sw/csgv/dl/apps/pytorch/1.9.0/lib/python3.7/site-packages/torch/nn/ UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ../c10/core/TensorImpl.h:1153.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) Epoch 1, batch 1/1563, loss: 7.122538089752197 .. Epoch 1, batch 1501/1563, loss: 5.0063157081604 Val accuracy: 0.003 Epoch 2, batch 1/1563, loss: 4.820196628570557 .. Epoch 2, batch 1501/1563, loss: 4.754003047943115 Val accuracy: 0.0353 Test accuracy: 0.0327 ==124733== Generated result file: /ibex/user/barradd/ksl_postmaint_tests/nvprof_cuda/profile.1234.nvvp

The output of the jobscript is a file called profile.${SLURM_JOBID}.nvvp. This file can be opened in NVIDIA's Visual Profiler nvvp to view the profile of the training script.

To launch the visualization of the profile on ibex glogin node (required OpenGL support) and the same three modules loaded

Make sure CUDA module is loaded and type in the terminal:


This will trigger the usual windows that will allow you to select your profile file


Then you will see a GUI opening that will allow exploring different parts of the process you ran:


Nsight Systems is a suite of profiling tools that replaces nvprof as CUDA releases progress. It provides a more detailed view of the workload than nvprof, and can be used to identify bottlenecks and optimize performance.

To collect the profiling information, submit a job as follows. This is the same job as above, but using Nsight Systems to profile and the machine learning module.


The above jobscript launches our python training using nsys profiler. Notice that we are loading only the machine learning module. As options, the command line also accepts the tracers you would like to use to trace different API calls by your code. In the jobscript above, we are choosing to trace cuda,cublas,cudnn API calls, and also osrt or OS Runtime calls (e.g. I/O calls). --stats=true allows printing a concise report in your SLURM output file for quick examination. In addition to this, the jobscript also instructs nsys to export the output collected and a SQLlite database which the Nsight-systems visual tool can easily search.

Where profile.11264040.nsys-rep is our profile.


The output is a stacked time series of all the resources and events traced. Hover your mouse on the event profile bar of CUDA HW(0000:b2:00.0Tesla V100-SXM2-32GB) and you will notice how busy you GPU has been. The time series can be zoomed in to inspect the events at short time scales down to micro, even nanoseconds. You can expand the above tab to show more event in finer granularity to see timing and sequence of different kernels. (Right click on CUDA HW(0000:b2:00.0Tesla V100-SXM2-32GB) tab and choose Show in Events View to inspect the table of the kernels profiled).

Nsight-systems with NVTX instrumentation

In a typical epoch of DL training, multiple mini-batches are trained, and often it is tricky to demarcate a mini-batch where it ends and the next one starts. NVIDIA Tools Extension or NVTX is a way to instrument your training script to annotate different operations of the training of a mini-batch. The code requires minimal change:

  • If you are using the machine learning model, you can directly add this line to your code

Annotate various operations of your training process

To instruct nsys profiler to collect the annotated profile in the training loop, the launch command will add nvtx tracer.

Upon visualizing, you can see an annotated training profile that is easier to track with the labels and colors you selected on the script