Wen-Chuang Chou

Logo

Passionate about AI with roots in theoretical neuroscience,
I explore LLMs, generative models, and human mobility—driven by a vision of accessible, everyday AI for all.

View My GitHub Profile

My AI Portfolio

Engineering advanced LLMs—from benchmarking and profiling performance to distilling DeepSeek R1.


Performance Benchmarking: FP32 vs. BF16 Mixed Precision

Brief Technique & Impact

The benchmarks evaluate a modern decoder-only Transformer, spanning five configurations from a “Small” base (12 layers, 768 hidden dimension) up to a 2.7B parameter model. Adopting BF16 mixed precision boosts inference throughput up to 6x and enables training larger, memory-intensive architectures that otherwise fail with full precision due to Out-of-Memory (OOM) constraints.

Performance Highlights

The benchmarks demonstrate BF16’s superior scalability. In inference, the largest 2.7B model achieves a nearly 600% speedup, jumping from 6.6k to 38.6k tokens/second. The small model (128M parameters) sees throughput nearly triple, reaching over 400k tokens/second. Training benefits are equally critical; while the 128M model trains 2.8x faster with BF16, the technique’s true value is unlocking larger architectures. FP32 fails to train the ‘large’ configuration due to memory limits, whereas BF16 handles it successfully at 24.8k tokens/second, proving essential for resource-constrained high-performance tasks.

Future Optimizations: Mitigating OOM

To further resolve memory bottlenecks, future work will integrate Activation Checkpointing and Flash Attention. These techniques significantly reduce memory usage, and updated benchmarks demonstrating their impact are in progress.


Profiling Arithmetic Intensity: MatMul vs. Softmax in Self-Attention

Theoretical Computational Complexity (FLOPs)

In the Transformer self-attention mechanism, we compare two primary operations: Matrix Multiplication (MatMul) for score calculation ($QK^T$) and the Softmax normalization layer. Let $S$ represent the sequence length and $D$ represent the head dimension.

Performance Profiling & Observations

Using PyTorch NVTX annotations and the NVIDIA Nsight Systems (nsys) profiler, I isolated these operations during a forward pass in transformer blocks with the configuration $S=128$ and $D=64$.

Metric Matrix Multiplication Softmax Ratio (MatMul/Softmax)
Runtime 226.06 $\mu s$ 467.49 $\mu s$ ~0.48x
Theoretical FLOPs $2S^2D$ $5S^2$ ~25.6x
(at $D=64$)

The Efficiency Paradox

The data reveals a striking discrepancy: while MatMul performs ~25.6 times more mathematical work than Softmax, it completes in less than half the time. This paradox highlights the difference between Compute-Bound and Memory-Bound operations:

  1. MatMul (Compute-Bound): Leveraging cuBLAS and hardware-level Tensor Cores (XMMA), the GPU performs massive parallel calculations on data already loaded into registers. It exhibits high Arithmetic Intensity.
  2. Softmax (Memory-Bound): Because Softmax is implemented as a series of separate CUDA kernels (Reduce, Exp, Add, Div), the GPU must move the ($S \times S$) matrix between VRAM and the cache for every single operation. This constant data movement creates a bottleneck, as memory bandwidth cannot keep up with the processing speed.

Solution: Fused Kernels

These results demonstrate that “FLOPs” are a poor predictor of actual runtime on modern GPUs. To optimize the Attention layer, one must implement Fused Kernels. Fusion allows the GPU to perform all Softmax steps while the data remains in fast Shared Memory, eliminating the costly round-trips to global memory.


AI agent : Agentic Retrieval-Augmented-Generation (RAG)

My work on Agentic RAG significantly enhances Large Language Model (LLM) performance for complex information-seeking. This project integrates intelligent AI agents into the RAG pipeline, yielding remarkably more accurate, robust, and contextually rich responses than traditional RAG.

Brief Technique & Impact

I developed an AI agent (using the smolagent package) capable of dynamic decision-making, iterative query reformulation, and intelligent document evaluation. A key contribution is an optimized parallel processing pipeline for efficient FAISS-based vector database creation from technical documentation. This framework fundamentally improves LLM output grounding through advanced reasoning and self-correction.

Performance Highlights

Evaluated on a technical Q&A dataset, Agentic RAG consistently demonstrated superior accuracy across various LLMs compared to both Standard RAG and standalone LLM performance:

Model Agentic RAG Standard RAG LLM Only
Gemini-1.5-flash 91.5% 85.4% 35.4%
Gemini-2.0-flash 90.8% 85.4% 64.1%
Gemini-2.5-flash 90.8% 86.2% 63.8%

All values above are accuracy scores (in %)

More details can be found in the project repository on GitHub.


AI Agent: Tool-Augmented Reasoning on GAIA Benchmark

I developed a tool-augmented AI code agent using the smolagents framework to tackle complex, agent-evaluating questions from the GAIA benchmark.
This system achieved a 40% correct answer rate—substantially outperforming GPT-4, which reached 14.4% under the same conditions.

Note: This project is currently under active development to further improve accuracy and generalization.


Distilling DeepSeek R1 for Enhanced LLM Performance

This project showcases a successful methodology for significantly enhancing large language model performance through advanced fine-tuning in a distributed HPC environment.

Brief Technique & Impact

This work focused on post-training weaker LLMs by fine-tuning the Qwen2.5 model using high-quality data distilled from DeepSeek R1. Employing Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) techniques across 8 H100 GPUs distributed over 2 HPC nodes, a key aspect of this project involved learning, optimizing, and simplifying the deployment of the training workflow for common HPC setups.

Performance Highlights

The rigorous fine-tuning process yielded substantial gains, boosting the Qwen2.5-Math-7B-Instruct model’s pass@1 accuracy on the AIME 2024 benchmark from 13.3% to a remarkable 56.7%, and on GPQA Diamond from 28.3% to 54.5%. This demonstrates the effectiveness of the distilled data approach in bringing weaker LLMs closer to DeepSeek R1’s performance.

Model AIME 2024
pass@1
MATH-500
pass@1
GPQA Diamond
pass@1
Qwen2.5-Math-7B-Instruct
(Original)
13.3 80.2 28.3
Qwen2.5-Math-7B-Instruct
(Fine-tuned on DeepSeek R1 distilled data)
56.7 89.8 54.5
DeepSeek-R1-Distill-Qwen-7B (Teacher) 53.3 93.2 53.0

More details can be found in the project repository on GitHub.


Fine-Tuning Llama 3 for Sentiment Analysis

This project fine-tunes Llama 3.1–8B Instruct to perform sentiment classification on short-form text, such as tweets. The model learns to identify sentiments—positive, neutral, or negative—using the tweet_sentiment_extraction subset from the MTEB benchmark.

Technique & Impact

Using instruction-style prompts and a streamlined training pipeline, the model was fine-tuned to produce accurate, single-word sentiment predictions. This significantly improved performance and makes the model well-suited for real-world applications like social media monitoring and customer feedback analysis.

Performance Highlights

On the MTEB tweet sentiment test set, the fine-tuned model achieved a notable accuracy gain:

Accuracy on MTEB Tweet Sentiment Classification

Model Accuracy (%)
Llama 3.1–8B (zero-shot) 63.41
Llama 3.1–8B (fine-tuned) 81.49

Radar plot showing model performance

More details can be found in the project repository on GitHub.


Fine-Tuning Stable Diffusion with LoRA

This project fine-tunes Stable Diffusion v2 (by Stability AI) using the Hugging Face Diffusers library to generate images in a customized visual style—in this case, the Naruto anime aesthetic.

Technique & Impact

The fine-tuning was performed using LoRA (Low-Rank Adaptation), which adapts pretrained diffusion models by inserting trainable low-rank matrices into existing weights, significantly reducing memory and compute requirements. The process was accelerated using 8× H100 GPUs across 2 HPC nodes, achieving a 77% reduction in training time compared to single-GPU training.

We used the lambdalabs/naruto-blip-captions dataset to teach the model the distinct visual characteristics of Naruto anime.

Dataset Examples

Naruto Example 1   Naruto Example 2

Performance Highlights

Prompt:
A detailed portrait of Hello Kitty, rendered in the style of Naruto anime, with a blue background.

Model Output Example
Base Stable Diffusion Base SD
LoRA Fine-Tuned Model LoRA Output 1 LoRA Output 2

The base model fails to capture Naruto’s stylistic elements, while the fine-tuned model successfully generates images in the correct anime style.

Note: This project is currently under active development to further improve accuracy and generalization.


Predicting Bike Traffic

I implemented Graph Attention Networks to predict bike traffic volume using social and environmental data. The models were trained separately on datasets from Dresden, Leipzig, and Hamburg. The following plot illustrates the results across the three cities:

prediction

This project won second place in the data science challenge at BTW 2023. More details are available on GitHub.


Speaker Identification

I developed a speaker identification system using Transformer and Conformer encoders, improving accuracy from 53.94% to 91.8% on a validation dataset of 56,666 voice recordings. More details are available on GitHub.


Anime Face Generator

Using a dataset of approximately 71,000 anime face images, I trained a diffusion probabilistic model to generate anime-style portraits. The generative network improved significantly over training iterations, as shown in the images below:

After 1,000 iterations (left) vs. 20,000 iterations (right):

1000 20000

More details are available on GitHub.