▲ 5 r/CLine
Running Cline/OpenHands on-prem with 4×RTX 3090: 30B vs 70–80B, vLLM/SGLang, SaaS cost reduction?
I’m evaluating an on-prem coding-agent setup to optimize Claude/GPT SaaS/API spend.
Hardware target: 4×RTX 3090, 96GB total VRAM, Linux, likely vLLM or SGLang exposing an OpenAI-compatible endpoint.
Tools: Cline and OpenHands.
Questions:
- Is Qwen3-Coder-30B-A3B enough as a daily driver, or is it underusing the hardware?
- Has anyone run Qwen3-Coder-Next 80B-A3B, Llama/Qwen 70B-class, or similar models on 4×3090 for coding agents?
- What quantization actually works well for tool-use and long-horizon coding tasks: FP8, AWQ, GPTQ, Q8, Q4?
- What context length is realistic before throughput collapses?
- Has this meaningfully reduced your Claude/OpenAI spend, or do you still need cloud fallback for hard tasks?
I’m especially interested in real-world results: tokens/s, accepted PRs/tasks, failure modes, model configs, and whether OpenHands/Cline behave reliably with local endpoints.
u/EmbarrassedBeach1069 — 3 days ago