Using ARM Performance Reports

ARM Performance Reports is a tool to get a high level understanding of the CPU, Memory and IO usage of your code, thus characterizing its bounds.
The only mandatory requirement for a code to be profiled with performance report tool is that it is dynamically linked. ARM Performance Report is installed on both Shaheen and Ibex.

perf-report on Ibex

To run perf-report on Ibex, you can do so in a SLURM jobscript. Here we are profiling a C code build from source using OpenMPI and dynamically linked. The executable heat_hybrid is a MPI+OpenMP code which we wish to run with 4 MPI tasks and 8 OpenMP threads per MPI task.

#!/bin/bash #SBATCH -n 4 #SBATCH -c 8 #SBATCH -N 1 #SBATCH -t 00:10:00 #SBATCH -A ibex-cs module load openmpi/4.0.3 module load arm-forge/20.2.1 export ALLINEA_LICENSE_FILE=$PERF_REPORTS_LICENCE_FILE export OMP_NUM_THREADS=8 perf-report --mpi -n 4 --openmp-threads=${OMP_NUM_THREADS} ./heat_hybrid

 

The code, when complete, produces two files, a text file and an HTML file, as reports.

Command: /ibex/scratch/shaima0d/scratch/forge_test/advanced-parallel-prog/hybrid/heat-fine/c/solution/heat_hybrid Resources: 1 node (128 physical, 128 logical cores per node) Memory: 504 GiB per node Tasks: 4 processes, OMP_NUM_THREADS was 8 Machine: cn514-05-r Start time: Tue Jan 11 14:27:12 2022 Total time: 3 seconds Full path: /ibex/scratch/shaima0d/scratch/forge_test/advanced-parallel-prog/hybrid/heat-fine/c/solution Summary: heat_hybrid is MPI-bound in this configuration Compute: 29.6% |==| MPI: 70.4% |======| I/O: 0.0% | This application run was MPI-bound. A breakdown of this time and advice for investigating further is in the MPI section below. CPU: A breakdown of the 29.6% CPU time: Single-core code: 9.5% || OpenMP regions: 90.5% |========| Scalar numeric ops: 1.5% || Vector numeric ops: 31.4% |==| Memory accesses: 45.1% |====| The per-core performance is memory-bound. Use a profiler to identify time-consuming loops and check their cache performance. MPI: A breakdown of the 70.4% MPI time: Time in collective calls: 6.3% || Time in point-to-point calls: 93.7% |========| Effective process collective rate: 0.00 bytes/s Effective process point-to-point rate: 26.8 MB/s I/O: A breakdown of the 0.0% I/O time: Time in reads: 0.0% | Time in writes: 0.0% | Effective process read rate: 0.00 bytes/s Effective process write rate: 0.00 bytes/s No time is spent in I/O operations. There's nothing to optimize here! OpenMP: A breakdown of the 90.5% time in OpenMP regions: Computation: 95.8% |=========| Synchronization: 4.2% || Physical core utilization: 25.0% |==| System load: 80.6% |=======| Physical core utilization is low and some cores may be unused. Try increasing OMP_NUM_THREADS to improve performance. Memory: Per-process memory usage may also affect scaling: Mean process memory usage: 166 MiB Peak process memory usage: 208 MiB Peak node memory usage: 24.0% |=| The peak node memory usage is very low. Running with fewer MPI processes and more data on each process may be more efficient. Energy: A breakdown of how energy was used: CPU: not supported System: not supported Mean node power: not supported Peak node power: 0.00 W Energy metrics are not available on this system. CPU metrics are not supported (no intel_rapl module)

 

The report above provides a high level breakdown of where the CPU time went and how much memory was used. It also reports on the fraction of time spent in OpenMP regions and how well the cores were utilized. The adviser provides good suggestions which can improve the performance of subsequent runs.