Using ARM Performance Reports
ARM Performance Reports is a tool to get a high level understanding of the CPU, Memory and IO usage of your code, thus characterizing its bounds.
The only mandatory requirement for a code to be profiled with performance report tool is that it is dynamically linked. ARM Performance Report is installed on both Shaheen and Ibex.
perf-report
on Ibex
To run perf-report
on Ibex, you can do so in a SLURM jobscript. Here we are profiling a C code build from source using OpenMPI and dynamically linked. The executable heat_hybrid
is a MPI+OpenMP code which we wish to run with 4 MPI tasks and 8 OpenMP threads per MPI task.
#!/bin/bash
#SBATCH -n 4
#SBATCH -c 8
#SBATCH -N 1
#SBATCH -t 00:10:00
#SBATCH -A ibex-cs
module load openmpi/4.0.3
module load arm-forge/20.2.1
export ALLINEA_LICENSE_FILE=$PERF_REPORTS_LICENCE_FILE
export OMP_NUM_THREADS=8
perf-report --mpi -n 4 --openmp-threads=${OMP_NUM_THREADS} ./heat_hybrid
The code, when complete, produces two files, a text file and an HTML file, as reports.
Command: /ibex/scratch/shaima0d/scratch/forge_test/advanced-parallel-prog/hybrid/heat-fine/c/solution/heat_hybrid
Resources: 1 node (128 physical, 128 logical cores per node)
Memory: 504 GiB per node
Tasks: 4 processes, OMP_NUM_THREADS was 8
Machine: cn514-05-r
Start time: Tue Jan 11 14:27:12 2022
Total time: 3 seconds
Full path: /ibex/scratch/shaima0d/scratch/forge_test/advanced-parallel-prog/hybrid/heat-fine/c/solution
Summary: heat_hybrid is MPI-bound in this configuration
Compute: 29.6% |==|
MPI: 70.4% |======|
I/O: 0.0% |
This application run was MPI-bound. A breakdown of this time and advice for investigating further is in the MPI section below.
CPU:
A breakdown of the 29.6% CPU time:
Single-core code: 9.5% ||
OpenMP regions: 90.5% |========|
Scalar numeric ops: 1.5% ||
Vector numeric ops: 31.4% |==|
Memory accesses: 45.1% |====|
The per-core performance is memory-bound. Use a profiler to identify time-consuming loops and check their cache performance.
MPI:
A breakdown of the 70.4% MPI time:
Time in collective calls: 6.3% ||
Time in point-to-point calls: 93.7% |========|
Effective process collective rate: 0.00 bytes/s
Effective process point-to-point rate: 26.8 MB/s
I/O:
A breakdown of the 0.0% I/O time:
Time in reads: 0.0% |
Time in writes: 0.0% |
Effective process read rate: 0.00 bytes/s
Effective process write rate: 0.00 bytes/s
No time is spent in I/O operations. There's nothing to optimize here!
OpenMP:
A breakdown of the 90.5% time in OpenMP regions:
Computation: 95.8% |=========|
Synchronization: 4.2% ||
Physical core utilization: 25.0% |==|
System load: 80.6% |=======|
Physical core utilization is low and some cores may be unused. Try increasing OMP_NUM_THREADS to improve performance.
Memory:
Per-process memory usage may also affect scaling:
Mean process memory usage: 166 MiB
Peak process memory usage: 208 MiB
Peak node memory usage: 24.0% |=|
The peak node memory usage is very low. Running with fewer MPI processes and more data on each process may be more efficient.
Energy:
A breakdown of how energy was used:
CPU: not supported
System: not supported
Mean node power: not supported
Peak node power: 0.00 W
Energy metrics are not available on this system.
CPU metrics are not supported (no intel_rapl module)
The report above provides a high level breakdown of where the CPU time went and how much memory was used. It also reports on the fraction of time spent in OpenMP regions and how well the cores were utilized. The adviser provides good suggestions which can improve the performance of subsequent runs.