Best models to tune with GRPO for my use case?
I'm working on a project where I'll be fine-tuning LLMs with GRPO on a 170K-sample dataset for explainable LJP (legal judgment prediction, where the model predicts case outcomes and generates step-by-step reasoning citing the facts). I'm considering models like GPT OSS 20B or Qwen 3.5 27B, with a slight preference for Qwen 3.5 27B because of its strong reasoning capabilities.
I recently obtained a 96GB VRAM workstation (RTX PRO 6000) to handle the RL rollouts, which should give some solid headroom for larger models.
What are your recommendations for the best open-source models for GRPO fine-tuning in 2026? Any advice on structuring explainable LJP rewards would also be appreciated.
Thanks!