LLM Distillation & Fine-Tuning

Distilling DeepSeek R1 for Enhanced LLM Performance

This project showcases a successful methodology for significantly enhancing large language model performance through advanced post-training in a distributed HPC environment.

Brief Technique & Impact

This work focused on post-training weaker LLMs by fine-tuning the Qwen2.5 model using high-quality data distilled from DeepSeek R1. Employing Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) techniques across 8 H100 GPUs distributed over 2 HPC nodes, a key aspect of this project involved learning, optimizing, and simplifying the deployment of the training workflow for common HPC setups.

Performance Highlights

The rigorous post-training process yielded substantial gains, boosting the Qwen2.5-Math-7B-Instruct model’s pass@1 accuracy on the AIME 2024 benchmark from 13.3% to a remarkable 56.7%, and on GPQA Diamond from 28.3% to 54.5%. This demonstrates the effectiveness of the distilled data approach in bringing weaker LLMs closer to DeepSeek R1’s performance.

Model	AIME 2024 pass@1	MATH-500 pass@1	GPQA Diamond pass@1
Qwen2.5-Math-7B-Instruct (Original)	13.3	80.2	28.3
Qwen2.5-Math-7B-Instruct (Fine-tuned on DeepSeek R1 distilled data)	56.7	89.8	54.5
DeepSeek-R1-Distill-Qwen-7B (Teacher)	53.3	93.2	53.0

More details can be found in the project repository on GitHub.

Fine-Tuning Llama 3 for Sentiment Analysis

This project fine-tunes Llama 3.1–8B Instruct to perform sentiment classification on short-form text, such as tweets. The model learns to identify sentiments—positive, neutral, or negative—using the tweet_sentiment_extraction subset from the MTEB benchmark.

Technique & Impact

Using instruction-style prompts and a streamlined training pipeline, the model was fine-tuned to produce accurate, single-word sentiment predictions. This significantly improved performance and makes the model well-suited for real-world applications like social media monitoring and customer feedback analysis.

Performance Highlights

On the MTEB tweet sentiment test set, the fine-tuned model achieved a notable accuracy gain:

Accuracy on MTEB Tweet Sentiment Classification

Model	Accuracy (%)
Llama 3.1–8B (zero-shot)	63.41
Llama 3.1–8B (fine-tuned)	81.49

Radar plot showing model performance

More details can be found in the project repository on GitHub.