Using Accelerate on Ibex

Accelerate provides an easy API to make your scripts run on any kind of distributed setting (multi-GPU on one node, multi-GPU on several nodes) while still letting you write your own training loop.

Installing Accelerator:

You’ll need to install conda first, please check https://kaust-supercomputing-lab.atlassian.net/wiki/spaces/Doc/pages/636125185/Recommendations+to+fix+broken+Minicondas+Installations+on+Ibex#Mambaforge-installation-on-WekaIO---Example

You can save the following as a file named env.yml

name: acc_env channels: - conda-forge - pytorch - nvidia - anaconda - defaults dependencies: - python=3.11.5 - pip=23.2.1 - accelerate=0.22.0 - cudatoolkit=11.8 - transformers=4.33.1 - pytorch=2.0.1 - torchvision=0.15.2 - torchaudio=2.0.2 - pytorch-cuda=11.8 - scikit-learn=1.2.2 - evaluate=0.4.0

once you created the file, run the following command to create the conda environment:

conda env create -f env.yml


Running Accelerator:

You can find an example Python training file in: complete_nlp_example.py

Launching accelerator in interactive session

You can start by requesting an interactive session from slurm with the desired number of GPUs.
Ex:

[elghm0a@login510-27 acc_test]$ srun -N 1 --gres=gpu:v100:8 --time=3:0:0 --pty bash

You’ll then need to activate the conda environment.

Finally, you can start the training process by calling accelerator’s launcher.

(Optional) you can add --checkpointing_steps epoch at the end to create checkpoints after each epoch.

The output should look like the following.

Launching accelerator in a jobscript

You can simply run accelerator through a slurm jobscript

Change <conda_installation_path> with the installation path for your conda.


The output should be redirected to a .out file

 

Reference

Accelerate Documentation (huggingface.co)