In some situations you may want to look at the real time utilization of the compute resources allocated to you. There are multiple way of doing this. NVIDIA published a neat solution which visualizes the time series output from nvidia-smi in a Bokeh dashboard. nvidia-smi uses nvml library to collect the metrics.

To install in your conda environment you can use pip:

pip install jupyterlab-nvdashboard

You can then launch the dashboard server in your jobscript:

#!/bin/bash
#SBATCH --gpus=1
#SBATCH --time=00:10:00

# Try different port if the following is occupied by another user
nvdashboard 10101 &
echo "ssh -L 10101:$(/bin/hostname):10101 $USER@glogin.ibex.kaust.edu.sa"
sleep 10

python train.py

To connect to this server, you will first establish ssh tunnel to the compute node it your training job is running on. For example the squeue -u $USER command tells you that your job has started on gpu212-04 node. Open a new terminal and run the following command:

ssh -L 10101:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa

note

Please use your username in place in the above command

Once logged in, use the following URL in your browser to access the dashboard:

http://localhost:10101

The following dashboard will show:

KSL How-To repository > Using NVDashboard for monitoring GPU metrics on Ibex > Screen Shot 2021-04-06 at 3.35.37 PM.png

You can select the metrics of your choice. For instance if I wish to see all the things related to GPUs:

KSL How-To repository > Using NVDashboard for monitoring GPU metrics on Ibex > Screen Shot 2021-04-06 at 3.34.56 PM.png

Few important things to note

In case of multi-node training jobs, you will need to run the nvdashboard command on all the nodes.

Also you will need to run multiple ssh tunnel connection for each node and will fire up in separate browsers. Try to use different localhost ports:

ssh -L 10101:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa
ssh -L 10102:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa
ssh -L 10103:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa
ssh -L 10104:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa

The above assumes connecting to 4 different nodes on Ibex. Your localhost (i.e. your laptop/workstation) will be listening to these nodes on 4 different ports.

NVLink metrics is broken at the moment and developer has an open Git issue to fix it