This repository provides a reproducible workflow for training and evaluating Hugging Face’s open-r1 model in a high-performance computing (HPC) environment that differs from the one used by the original developers.
While the original open-r1
repository assumes access to a node with 8×H100 GPUs, many HPC setups offer nodes with only 4 GPUs. This guide demonstrates how to adapt the training process to nodes with 4 GPUs. In particular, the provided configuration supports training across 2 nodes, each equipped with 4×H100 GPUs.
Since Open R1 is an actively developed project, the upstream repository changes frequently and the official README may not always reflect the latest updates. This branch targets a specific commit of the original repository, and the steps provided here are tailored accordingly to ensure reproducibility.
The original repository recommends using uv
to manage dependencies. However, uv
can be directly installed in HPC environments.
To work around this:
uv
inside this virtual environment:
```bash
pip install uvuv
is installed, follow the installation steps from the original repository to install all required dependencies.[!NOTE]
The training setup below is configured for 2 nodes, each equipped with 4×H100 (80GB) GPUs.
For different hardware configurations or topologies, adjust the per-device batch size and the number of gradient accumulation steps to maintain the same global batch size.
Global Batch Size = (Per-Device Batch Size × Number of GPUs) × Gradient Accumulation Steps
To perform supervised fine-tuning (SFT) using DeepSpeed ZeRO-3 on 2 nodes, run the following command:
sbatch slurm/quick_sfttrain.slurm
This script fine-tunes the Qwen2.5-Math-7B-Instruct
model on the OpenR1-Math-220k
dataset using configurations optimized for multi-node training.OpenR1-Math-220k dataset using configurations optimized for multi-node training.
Once training is complete, model evaluation is performed using lighteval
.
To run evaluation on 1 nodes with 4GPUs, use the following Slurm command:
sbatch slurm/quick_evaluate.slurm
This script evaluates the fine-tuned model on the MATH-500
benchmark dataset. The results will be saved in the data\evals
directory.
The following table presents the pass@1 accuracy of different models on the AIME 2024
, MATH-500
, and GPQA Diamond
benchmarks. Evaluation followed the script described above.
The Qwen2.5-Math-7B-Instruct
model (student) was fine-tuned using data distilled from the DeepSeek R1
model (teacher), resulting in a substantial performance gain on the AIME 2024 benchmark—from 13.3% to 56.7% pass@1 accuracy.
Model | AIME 2024 pass@1 |
MATH-500 pass@1 |
GPQA Diamond pass@1 |
---|---|---|---|
Qwen2.5-Math-7B-Instruct (Original) |
13.3 | 80.2 | 28.3 |
Qwen2.5-Math-7B-Instruct (Fine-tuned on DeepSeek R1 distilled data) |
56.7 | 89.8 | 54.5 |
DeepSeek-R1-Distill-Qwen-7B (Teacher) |
53.3 | 93.2 | 53.0> |
I would like to thank all the contributors to the open-r1 project for their valuable work. I am also grateful to the Center for Information Services and High Performance Computing — [Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)] at TU Dresden — for providing access to its facilities for high-throughput computations.