I implemented PPO, GRPO, and DPO from scratch on the same model and compared them — the ranking completely reversed after hyperparameter tuning
Over the last couple of months I built a full LLM training pipeline from scratch in PyTorch architecture, pretraining, SFT, reward modeling, and three post-training alignment methods. No pretrained weights, no alignment libraries.
I just published the final comparison study. The short version:
Phase 1 results (baseline hyperparameters): PPO: +3.99 → GRPO: -0.12 → DPO: +2.40 (average reward on 16 fixed prompts)
Phase 5 results (after targeted tuning): DPO: +4.15 → SFT: +4.13 → GRPO: +3.31 → PPO: +3.52
The Phase 1 winner became the Phase 5 loser. A few things I found interesting:
GRPO group collapse is real and diagnosable. With k=4, two of my 16 prompts had group std=0 no gradient flowed at all on those prompts. Increasing k to 8 and generation temperature to 1.0 fixed it completely. The +3.43 improvement is the clearest causal result in the whole study.
DPO reward margin explosion is a training signal, not a success metric. With β=0.1, the margin grew from ~1 to 599 by step 150. Loss collapsed to zero by step 30. The model was overfitting each pair rather than learning a general preference. Increasing β to 0.3 slowed this down and produced actual negative margins at some steps which sounds bad but is the loss function doing its job correctly.
PPO over-correction goes in both directions. kl_coef=0.01 was too weak (forgetting SFT-strong prompts), kl_coef=0.1 was too strong (over-constraining the policy). The optimal value is somewhere between them.
Evaluation temperature matters independently of training. SFT improved by +1.12 with zero retraining just by changing from temperature=0.7 to temperature=0.3. Phase 1 underestimated SFT's ceiling.
Full write-up with training curves, comparison tables, per-prompt delta heatmap, and DPO/GRPO training dynamics: brayanbrayan.github.io/2026/04/02/rlhf-post-blog.html
I'm a self-taught ML engineer based in Nairobi actively looking for research or engineering roles in alignment and RL. If anything here resonates with what your team works on, feel free to reach out.