u/EmbarrassedBeach1069

I’m evaluating an on-prem coding-agent setup to optimize Claude/GPT SaaS/API spend.

Hardware target: 4×RTX 3090, 96GB total VRAM, Linux, likely vLLM or SGLang exposing an OpenAI-compatible endpoint.

Tools: Cline and OpenHands.

Questions:

Is Qwen3-Coder-30B-A3B enough as a daily driver, or is it underusing the hardware?
Has anyone run Qwen3-Coder-Next 80B-A3B, Llama/Qwen 70B-class, or similar models on 4×3090 for coding agents?
What quantization actually works well for tool-use and long-horizon coding tasks: FP8, AWQ, GPTQ, Q8, Q4?
What context length is realistic before throughput collapses?
Has this meaningfully reduced your Claude/OpenAI spend, or do you still need cloud fallback for hard tasks?

I’m especially interested in real-world results: tokens/s, accepted PRs/tasks, failure modes, model configs, and whether OpenHands/Cline behave reliably with local endpoints.

Running Cline/OpenHands on-prem with 4×RTX 3090: 30B vs 70–80B, vLLM/SGLang, SaaS cost reduction?