Passionate about AI with roots in theoretical neuroscience,
I explore LLMs, AI agent, and human mobility—driven by a vision of accessible, everyday AI for all.
This project showcases a successful methodology for significantly enhancing large language model performance through advanced post-training in a distributed HPC environment.
This work focused on post-training weaker LLMs by fine-tuning the Qwen2.5 model using high-quality data distilled from DeepSeek R1. Employing Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) techniques across 8 H100 GPUs distributed over 2 HPC nodes, a key aspect of this project involved learning, optimizing, and simplifying the deployment of the training workflow for common HPC setups.
The rigorous post-training process yielded substantial gains, boosting the Qwen2.5-Math-7B-Instruct model’s pass@1 accuracy on the AIME 2024 benchmark from 13.3% to a remarkable 56.7%, and on GPQA Diamond from 28.3% to 54.5%. This demonstrates the effectiveness of the distilled data approach in bringing weaker LLMs closer to DeepSeek R1’s performance.
| Model | AIME 2024 pass@1 |
MATH-500 pass@1 |
GPQA Diamond pass@1 |
|---|---|---|---|
| Qwen2.5-Math-7B-Instruct (Original) |
13.3 | 80.2 | 28.3 |
| Qwen2.5-Math-7B-Instruct (Fine-tuned on DeepSeek R1 distilled data) |
56.7 | 89.8 | 54.5 |
| DeepSeek-R1-Distill-Qwen-7B (Teacher) | 53.3 | 93.2 | 53.0 |
More details can be found in the project repository on GitHub.
This project fine-tunes Llama 3.1–8B Instruct to perform sentiment classification on short-form text, such as tweets. The model learns to identify sentiments—positive, neutral, or negative—using the tweet_sentiment_extraction subset from the MTEB benchmark.
Using instruction-style prompts and a streamlined training pipeline, the model was fine-tuned to produce accurate, single-word sentiment predictions. This significantly improved performance and makes the model well-suited for real-world applications like social media monitoring and customer feedback analysis.
On the MTEB tweet sentiment test set, the fine-tuned model achieved a notable accuracy gain:
Accuracy on MTEB Tweet Sentiment Classification
| Model | Accuracy (%) |
|---|---|
| Llama 3.1–8B (zero-shot) | 63.41 |
| Llama 3.1–8B (fine-tuned) | 81.49 |
More details can be found in the project repository on GitHub.