Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

Overview

On Shaheen, there a number of profiling tools collect performance of application either by instrumenting them or using special programs like ARM Forge’s perf-report or CrayPAT’s pat_run or Intel’s vtune which can profile pre-built applications for performance analysis (some have limitations/conditions).

When running a containerized application the environment is contained/isolated inside a container. This may make some profilers inaccessible from within a container.

In this article, we demonstrate how to run a profiling job in a Singularity container to collect performance metrics of an OpenMP program submitted as a batch job to SLURM on Shaheen compute nodes.

Compilation

Compile your code inside a container either interactively in a shell environment or in a batch job.

Let’s load singularity and fire up a bashshell in a container:

shaima0d@cdl2> module load singularity
shaima0d@cdl2> singularity shell ../../mpich332_ksl_latest.sif 
Singularity> ls
Makefile  gauss-omp  gauss-scaling-omp.sh  include  src

Our Makefile requires a gcc compiler (nothing fancy) and adds OpenMP flag (-fopenmp) to compile and link with gomp support.

Singularity> cat Makefile 
CC=gcc
F90=gfortran
PRGENV=${PE_ENV}
CFLAGS=-g -std=gnu99
SOURCE=src
INC= -I ./include
#ifeq ($(PRGENV),INTEL)
	OMP_FLAG=-fopenmp $(CFLAGS)
#else ifeq ($(PRGENV),GNU)
#	OMP_FLAG=-fopenmp $(CFLAGS)
#else ifeq ($(PRGENV),CRAY)
#	OMP_FLAG=-homp -G2
#endif 
all: omp 

omp:
	$(CC) $(CFLAGS) src/gaussian.c src/gauss_omp.c -o gauss-omp $(INC) $(OMP_FLAG) 

clean:
	rm -r gauss-omp

Compiler needs to be installed in your singularity image in user space:

Singularity> which gcc 
/usr/bin/gcc
Singularity> gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

We run make as usual:

Singularity> make VERBOSE=1
gcc -g -std=gnu99 src/gaussian.c src/gauss_omp.c -o gauss-omp -I ./include -fopenmp -g -std=gnu99 

Jobscript to launch VTune profiling

To launch Intel VTune, we need to first activate it in our host environment and then pass it to container run time environment. The Jobscript below does exactly that and the launches a hotspot collection for our OpenMP executable.

#!/bin/bash

#SBATCH -p debug
#SBATCH -n 1
#SBATCH -c 32

module load singularity

export IMAGE=$PWD/mpich332_ksl_latest.sif

# Set required number of threads and pass to Singularity environment
export OMP_NUM_THREADS=4
export SINGULARITYENV_OMP_NUM_THREADS=$OMP_NUM_THREADS

#Activate VTune
source /opt/intel/vtune_profiler/amplxe-vars.sh
export SINGULARITYENV_PREPEND_PATH=$PATH


srun -n 1 -c ${OMP_NUM_THREADS} --hint=nomultithread \
singularity exec -B /opt,/sw $IMAGE \
vtune -report summary -collect hotspots \
./Gauss_demo/Gauss_omp/gauss-omp -n 2048 -f /project/k01/shaima0d/tickets/32169/Gauss_demo/input_matrices/2048by2048.mat

We see the following output in our standard out which is a summary of the hotspot collection, along with some runtime logs of VTune:

Copyright (C) 2009-2019 Intel Corporation. All rights reserved.
Intel(R) VTune(TM) Profiler 2020 (build 605129)
vtune: Collection started. To stop the collection, either press CTRL-C or enter from another console window: vtune -r /project/k01/shaima0d/tickets/32169/r002hs -command stop.
Reading in the System of linear equations from file /project/k01/shaima0d/tickets/32169/Gauss_demo/input_matrices/2048by2048.mat
Execution time=2.672976

 Writing solution to file
vtune: Collection stopped.
vtune: Using result path `/project/k01/shaima0d/tickets/32169/r002hs'
vtune: Executing actions 19 % Resolving information for `libgomp.so.1'         
vtune: Warning: Cannot locate debugging information for file `/usr/lib/x86_64-linux-gnu/libgomp.so.1'.
vtune: Executing actions 20 % Resolving information for `libpthread.so.0'      
vtune: Warning: Cannot locate debugging information for file `/lib/x86_64-linux-gnu/libpthread.so.0'.
vtune: Warning: Cannot locate debugging information for file `/lib/x86_64-linux-gnu/libc.so.6'.
Cannot match the module with the symbol file `/lib/x86_64-linux-gnu/libc-2.27.so'. Make sure to specify the correct path to the symbol file in the Binary/Symbol Search list of directories.
vtune: Executing actions 22 % Resolving information for `libtpsstool.so'       
vtune: Warning: Cannot locate debugging information for file `/opt/intel/vtune_profiler_2020.0.0.605129/lib64/libtpsstool.so'.
vtune: Executing actions 49 % Saving the resultElapsed Time: 3.992s            
    CPU Time: 11.720s
        Effective Time: 11.720s
            Idle: 0s
            Poor: 11.720s
            Ok: 0s
            Ideal: 0s
            Over: 0s
        Spin Time: 0s
        Overhead Time: 0s
    Total Thread Count: 4
    Paused Time: 0s

Top Hotspots
Function                Module        CPU Time
----------------------  ------------  --------
solve._omp_fn.0         gauss-omp      10.422s
__isoc99_fscanf         libc.so.6       1.050s
func@0xa5a0             libgomp.so.1    0.184s
GOMP_parallel           libgomp.so.1    0.012s
GOMP_loop_runtime_next  libgomp.so.1    0.012s
[Others]                N/A             0.040s
Effective Physical Core Utilization: 9.3% (2.963 out of 32)
 | The metric value is low, which may signal a poor physical CPU cores
 | utilization caused by:
 |     - load imbalance
 |     - threading runtime overhead
 |     - contended synchronization
 |     - thread/process underutilization
 |     - incorrect affinity that utilizes logical cores instead of physical
 |       cores
 | Explore sub-metrics to estimate the efficiency of MPI and OpenMP parallelism
 | or run the Locks and Waits analysis to identify parallel bottlenecks for
 | other parallel runtimes.
 |
    Effective Logical Core Utilization: 4.6% (2.963 out of 64)
     | The metric value is low, which may signal a poor logical CPU cores
     | utilization. Consider improving physical core utilization as the first
     | step and then look at opportunities to utilize logical cores, which in
     | some cases can improve processor throughput and overall performance of
     | multi-threaded applications.
     |
Collection and Platform Info
    Application Command Line: ./Gauss_demo/Gauss_omp/gauss-omp "-n" "2048" "-f" "/project/k01/shaima0d/tickets/32169/Gauss_demo/input_matrices/2048by2048.mat" 
    Operating System: 4.12.14-150.17_5.0.91-cray_ari_c NAME="Ubuntu" VERSION="18.04.4 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.4 LTS" VERSION_ID="18.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic
    Computer Name: nid00008
    Result Size: 3 MB 
    Collection start time: 12:10:17 25/08/2020 UTC
    Collection stop time: 12:10:21 25/08/2020 UTC
    Collector Type: Driverless Perf per-process counting,User-mode sampling and tracing
    CPU
        Name: Intel(R) Xeon(R) E5/E7 v3 Processor code named Haswell
        Frequency: 2.300 GHz 
        Logical CPU Count: 64

If you want to skip descriptions of detected performance issues in the report,
enter: vtune -report summary -report-knob show-issues=false -r <my_result_dir>.
Alternatively, you may view the report in the csv format: vtune -report
<report_name> -format=csv.
vtune: Executing actions 100 % done

Clearly our code needs some seriously optimization.

The resulting performance data is maintained by VTune in current working directory in a directory (created for each profiling run).

shaima0d@cdl2> ls -l
total 155992
drwxr-sr-x 5 shaima0d k01      4096 Aug 20 13:00 Gauss_demo
-rw-r--r-- 1 shaima0d k01       577 Aug 25 15:10 jobscript.slurm
-rwxr-xr-x 1 shaima0d k01 159686656 Aug 20 13:06 mpich332_ksl_latest.sif
drwxr-sr-x 7 shaima0d k01      4096 Aug 25 15:14 r000hs
-rw-r--r-- 1 shaima0d k01      8587 Aug 25 15:14 slurm-15509692.out
-rw-r--r-- 1 shaima0d k01     17406 Aug 25 15:14 solution.out

We can visually analyze this results using vtune-gui to dive deeper and identify some hotspots in our source code. Before launching the GUI, please make sure you have logged in with X11 forwarding enabled (i.e. ssh -X ... or on MacOS ssh -Y ...

shaima0d@cdl2:/project/k01/shaima0d/tickets/32169> source /opt/intel/vtune_profiler/amplxe-vars.sh
Copyright (C) 2009-2019 Intel Corporation. All rights reserved.
Intel(R) VTune(TM) Profiler 2020 (build 605129)
shaima0d@cdl2:/project/k01/shaima0d/tickets/32169> vtune-gui r000hs  

This will open up a GUI as a X window on your laptop/workstation:

Selecting the appropriate analysis view (Bottom up in the case below), one can investigate for hotspots further:

  • No labels