Profiling application in a Singularity container with Intel VTune on Shaheen
Overview
On Shaheen, there a number of profiling tools collect performance of application either by instrumenting them or using special programs like ARM Forge’s perf-report
or CrayPAT’s pat_run
or Intel’s vtune
which can profile pre-built applications for performance analysis (some have limitations/conditions).
When running a containerized application the environment is contained/isolated inside a container. This may make some profilers inaccessible from within a container.
In this article, we demonstrate how to run a profiling job in a Singularity container to collect performance metrics of an OpenMP program submitted as a batch job to SLURM on Shaheen compute nodes.
Compilation
Compile your code inside a container either interactively in a shell
environment or in a batch job.
Let’s load singularity
and fire up a bash
shell in a container:
shaima0d@cdl2> module load singularity
shaima0d@cdl2> singularity shell ../../mpich332_ksl_latest.sif
Singularity> ls
Makefile gauss-omp gauss-scaling-omp.sh include src
Our Makefile
requires a gcc
compiler (nothing fancy) and adds OpenMP flag (-fopenmp
) to compile and link with gomp
support.
Singularity> cat Makefile
CC=gcc
F90=gfortran
PRGENV=${PE_ENV}
CFLAGS=-g -std=gnu99
SOURCE=src
INC= -I ./include
#ifeq ($(PRGENV),INTEL)
OMP_FLAG=-fopenmp $(CFLAGS)
#else ifeq ($(PRGENV),GNU)
# OMP_FLAG=-fopenmp $(CFLAGS)
#else ifeq ($(PRGENV),CRAY)
# OMP_FLAG=-homp -G2
#endif
all: omp
omp:
$(CC) $(CFLAGS) src/gaussian.c src/gauss_omp.c -o gauss-omp $(INC) $(OMP_FLAG)
clean:
rm -r gauss-omp
Compiler needs to be installed in your singularity image
in user space:
Singularity> which gcc
/usr/bin/gcc
Singularity> gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
We run make
as usual:
Jobscript to launch VTune profiling
To launch Intel VTune, we need to first activate it in our host environment and then pass it to container run time environment. The Jobscript below does exactly that and the launches a hotspot
collection for our OpenMP executable.
We see the following output in our standard out which is a summary of the hotspot
collection, along with some runtime logs of VTune:
Clearly our code needs some seriously optimization.
The resulting performance data is maintained by VTune in current working directory in a directory (created for each profiling run).
We can visually analyze this results using vtune-gui
to dive deeper and identify some hotspots in our source code. Before launching the GUI, please make sure you have logged in with X11 forwarding enabled (i.e. ssh -X ...
or on MacOS ssh -Y ...
This will open up a GUI as a X window on your laptop/workstation:
Selecting the appropriate analysis view (Bottom up in the case below), one can investigate for hotspots further: