Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

In some situations you may want to look at the real time utilization of the compute resources allocated to you. There are multiple way of doing this. NVIDIA published a neat solution which visualizes the time series output from nvidia-smi in a Bokeh dashboard. nvidia-smi uses nvml library to collect the metrics.

To install in your conda environment you can use pip:

pip install jupyterlab-nvdashboard

You can then launch the dashboard server in your jobscript:

#!/bin/bash
#SBATCH --gpus=1
#SBATCH --time=00:10:00

# Try different port if the following is occupied by another user
nvdashboard 10101 &
sleep 10

python train.py 

To connect to this server, you will first establish ssh tunnel to the compute node it your training job is running on. For example the squeue -u $USER command tells you that your job has started on gpu212-04 node. Open a new terminal and run the following command:

ssh -L 10101:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa

Please use your username in place in the above command

Once logged in, use the following URL in your browser to access the dashboard:

http://localhost:10101

The following dashboard will show:

You can select the metrics of your choice. For instance if I wish to see all the things related to GPUs:

Few important things to note

  • In case of multi-node training jobs, you will need to run the nvdashboard command on all the nodes.

  • Also you will need to run multiple ssh tunnel connection for each node and will fire up in separate browsers. Try to use different localhost ports:

    ssh -L 10101:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa
    ssh -L 10102:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa
    ssh -L 10103:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa
    ssh -L 10104:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa

    The above assumes connecting to 4 different nodes on Ibex. Your localhost (i.e. your laptop/workstation) will be listening to these nodes on 4 different ports.

  • NVLink metrics is broken at the moment and developer has an open Git issue to fix it

  • No labels