In some situations you may want to look at the real time utilization of the compute resources allocated to you. There are multiple way of doing this. NVIDIA published a neat solution which visualizes the time series output from nvidia-smi in a Bokeh dashboard. nvidia-smi uses nvml library to collect the metrics.
To install in your conda environment you can use pip:
pip install jupyterlab-nvdashboard
You can then launch the dashboard server in your jobscript:
# Try different port if the following is occupied by another user
nvdashboard 10101 &
echo "ssh -L 10101:$(/bin/hostname):10101 $USER@glogin.ibex.kaust.edu.sa"
To connect to this server, you will first establish ssh tunnel to the compute node it your training job is running on. For example the squeue -u $USER command tells you that your job has started on gpu212-04 node. Open a new terminal and run the following command: