r/reinforcementlearning

I implemented PPO, GRPO, and DPO from scratch on the same model and compared them the ranking completely reversed after hyperparameter tuning

Over the last couple of months I built a full LLM training pipeline from scratch in PyTorch architecture, pretraining, SFT, reward modeling, and three post-training alignment methods. No pretrained weights, no alignment libraries.

I just published the final comparison study. The short version:

Phase 1 results (baseline hyperparameters): PPO: +3.99 → GRPO: -0.12 → DPO: +2.40 (average reward on 16 fixed prompts)

Phase 5 results (after targeted tuning): DPO: +4.15 → SFT: +4.13 → GRPO: +3.31 → PPO: +3.52

The Phase 1 winner became the Phase 5 loser. A few things I found interesting:

GRPO group collapse is real and diagnosable. With k=4, two of my 16 prompts had group std=0 no gradient flowed at all on those prompts. Increasing k to 8 and generation temperature to 1.0 fixed it completely. The +3.43 improvement is the clearest causal result in the whole study.

DPO reward margin explosion is a training signal, not a success metric. With β=0.1, the margin grew from ~1 to 599 by step 150. Loss collapsed to zero by step 30. The model was overfitting each pair rather than learning a general preference. Increasing β to 0.3 slowed this down and produced actual negative margins at some steps which sounds bad but is the loss function doing its job correctly.

PPO over-correction goes in both directions. kl_coef=0.01 was too weak (forgetting SFT-strong prompts), kl_coef=0.1 was too strong (over-constraining the policy). The optimal value is somewhere between them.

Evaluation temperature matters independently of training. SFT improved by +1.12 with zero retraining just by changing from temperature=0.7 to temperature=0.3. Phase 1 underestimated SFT's ceiling.

Full write-up with training curves, comparison tables, per-prompt delta heatmap, and DPO/GRPO training dynamics: brayanbrayan.github.io/2026/04/02/rlhf-post-blog.html

I'm a self-taught ML engineer based in Nairobi actively looking for research or engineering roles in alignment and RL. If anything here resonates with what your team works on, feel free to reach out.

reddit.com
u/Public_Expression_92 — 5 hours ago

Best models to tune with GRPO for my use case?

I'm working on a project where I'll be fine-tuning LLMs with GRPO on a 170K-sample dataset for explainable LJP (legal judgment prediction, where the model predicts case outcomes and generates step-by-step reasoning citing the facts). I'm considering models like GPT OSS 20B or Qwen 3.5 27B, with a slight preference for Qwen 3.5 27B because of its strong reasoning capabilities.

I recently obtained a 96GB VRAM workstation (RTX PRO 6000) to handle the RL rollouts, which should give some solid headroom for larger models.

What are your recommendations for the best open-source models for GRPO fine-tuning in 2026? Any advice on structuring explainable LJP rewards would also be appreciated.

Thanks!

reddit.com
u/Extra-Campaign7281 — 2 hours ago
979,200 evaluation episodes measuring RL behavioral stability - reward explains 3.7% of stability variance [results + code]

979,200 evaluation episodes measuring RL behavioral stability - reward explains 3.7% of stability variance [results + code]

Hi Everyone. Sharing the complete results from ARCUS-H, a post-hoc evaluation harness measuring behavioral stability of trained RL policies under structured stress.

What ARCUS-H does

Three-phase protocol (pre/shock/post) applied to any SB3 policy. Eight stressors across three failure axes:

  • Perception: CD (concept drift) · ON (obs noise) · SB (sensor blackout)
  • Execution: RC (reward compression) · TV (actuator corruption)
  • Feedback: VI (reward inversion) · RN (reward noise)

Five channels: Competence · Policy Consistency · Temporal Stability · Observation Reliability · Action Entropy Divergence

No retraining. No model internals.

Scale

51 (env, algo) pairs · 12 environments · 8 algorithms · 8 stressors · 10 seeds · 979,200 evaluation episodes

https://preview.redd.it/6n24vpbv42tg1.png?width=1737&format=png&auto=webp&s=82b9d9d31e78587a9e422a35ec8b646a3311b2d0

Finding 1: r = +0.240 [0.111, 0.354]

This is the primary number (env stressors only, VI/RN excluded). compare.py also outputs r = +0.311 for all 8 stressors — that number is inflated by circularity: VI and RN corrupt the reward signal, which is 15% of the ARCUS score formula. Don't cite 0.247 as the main result.

Spearman r = +0.180. R² = 0.057.

Earlier pilot on 47 pairs: r = 0.286 [0.149, 0.411]. The decrease to 0.240 reflects adding SpaceInvaders and Walker2d. The CI narrowed by 69%. The full evaluation is more reliable and more diverse.

Finding 2: SAC 92.5% vs TD3 61.0% under observation noise

Replicated across 51 pairs and 10 seeds.

Finding 3: Pong 41.9% vs SpaceInvaders 13.0% under obs noise

Same CNN. Same wrapper. Representation structure, not architecture.

Finding 4: Walker2d-v4 (new)

FPR = 0.053. MuJoCo fragility confirmed on a third locomotion env.

Code and data

https://github.com/karimzn00/ARCUSH

reddit.com
u/Less_Conclusion9066 — 20 hours ago
Week