Using Accelerate on Ibex
Accelerate provides an easy API to make your scripts run on any kind of distributed setting (multi-GPU on one node, multi-GPU on several nodes) while still letting you write your own training loop.
Installing Accelerator:
You’ll need to install conda first, please check Recommendations to fix broken Minicondas Installations on Ibex | Mambaforge installation on WekaIO Example
You can save the following as a file named env.yml
name: acc_env
channels:
- conda-forge
- pytorch
- nvidia
- anaconda
- defaults
dependencies:
- python=3.11.5
- pip=23.2.1
- accelerate=0.22.0
- cudatoolkit=11.8
- transformers=4.33.1
- pytorch=2.0.1
- torchvision=0.15.2
- torchaudio=2.0.2
- pytorch-cuda=11.8
- scikit-learn=1.2.2
- evaluate=0.4.0
once you created the file, run the following command to create the conda environment:
conda env create -f env.yml
Running Accelerator:
You can find an example Python training file in: complete_nlp_example.py
Launching accelerator in interactive session
You can start by requesting an interactive session from slurm with the desired number of GPUs.
Ex:
[elghm0a@login510-27 acc_test]$ srun -N 1 --gres=gpu:v100:8 --time=3:0:0 --pty bash
You’ll then need to activate the conda environment.
Finally, you can start the training process by calling accelerator’s launcher.
(Optional) you can add --checkpointing_steps epoch
at the end to create checkpoints after each epoch.
The output should look like the following.
Launching accelerator in a jobscript
You can simply run accelerator through a slurm jobscript
Change <conda_installation_path>
with the installation path for your conda.
The output should be redirected to a .out
file