u/FroyoEducational4851

▲ 2 r/LocalLLM+1 crossposts

I don’t mind the model taking time to respond, but seeing the whole thinking/reasoning process on screen gets distracting really fast.

Is there a clean way to hide it while still letting the model think normally in the background?

reddit.com
u/FroyoEducational4851 — 8 days ago
▲ 2 r/ollama

I’ve been testing local models for retrieval augmented generation, document querying, and structured outputs, but I keep running into tradeoffs between reasoning, context handling, schema reliability, and hardware efficiency.

So far I’ve tried Gemma, Minimax, and a bit of Command-R, and I’m now looking into Qwen and LFM2. Gemma felt solid overall, but schema outputs became inconsistent under heavier workloads. Minimax felt weaker than I expected, though that might’ve been my setup.

Curious what models people are actually sticking with for serious local workflows.

reddit.com
u/FroyoEducational4851 — 9 days ago

I’m trying to run a local setup for retrieval augmented generation and some machine learning work.

Curious what models people are actually using right now and how they’re performing.

reddit.com
u/FroyoEducational4851 — 9 days ago
▲ 0 r/ollama

Tried a longer run with Ollama and got:

  • Model: Qwen3-Coder 30B
  • ~40k tokens in ~14 min
  • ~48 tok/sec (pretty stable)

System:

  • RAM ~23GB (almost full)
  • Swap ~1.5–2GB
  • CPU ~200%+, GPU ~70%

Feels solid, but not sure if this is expected or if I’m hitting the ceiling.

Anyone getting better numbers on similar hardware? Any Ollama tweaks worth trying?

reddit.com
u/FroyoEducational4851 — 12 days ago