...
In case of multi-node training jobs, you will need to run the
nvdashboard
command on all the nodes.Also you will need to run multiple
ssh
tunnel connection for each node and will fire up in separate browsers. Try to use different localhost ports:Code Block ssh -L 10101:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa ssh -L 10102:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa ssh -L 10103:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa ssh -L 10104:gpu212-04:10101 username@glogin.ibex.kaust.edu.sa
The above assumes connecting to 4 different nodes on Ibex. Your localhost (i.e. your laptop/workstation) will be listening to these nodes on 4 different ports.
NVLink metrics is broken at the moment and develops have developer has an open Git issue to fix it