u/Clean_Initial_9618

Qwen 27B + Hermes Agent: Preserve Thinking On or Off? + Sampling Tips

I’ve been messing around with a local setup and wanted to sanity check a few things with people who’ve gone deeper into this.

Right now I’m running Qwen 3.6 27B (Q5, MTP) on a 3090, hooked up to a Hermes agent setup. Mostly just experimenting, testing tool use and seeing how far I can push it.

One thing I’m not fully clear on:

For agentic workflows (Hermes or similar), do you guys usually keep “preserve thinking” on or off?

I’ve tried both, and it feels like:

  • On → better reasoning sometimes, but can get stuck in loops or overthink tool calls
  • Off → more direct, but occasionally dumber decisions

Not sure what the general consensus is here.

Also curious what sampling settings people are using for agents specifically. I’m trying to reduce:

  • repeated / looping tool calls
  • over-calling tools when it’s not needed
  • getting stuck in “thinking → tool → thinking → tool” cycles

Would really appreciate if you could share:

  • your go-to params (temp, top_p, repeat penalty, etc.)
  • any tricks to make tool usage more stable

Still early in testing, so open to completely rethinking the setup if needed. Thanks a lot for your time and advice.

reddit.com
u/Clean_Initial_9618 — 2 days ago

Hey everyone,

I’ve been experimenting with running Qwen models locally on my setup:

GPU: RTX 3090 (24GB VRAM)

RAM: 64GB

CPU: Ryzen 5700X

OS: Windows 11

What I’m currently running

Qwen 3.6 35B (UD Q4_K_M)

llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -ngl 99 -c 131072 -np 2 -fa on -ctk f16 -ctv f16 -b 2048 -ub 512 -t 8 --mlock -rea on --reasoning-budget 2048 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0

Qwen 3.6 27B (UD Q4_K_XL)

llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 196608 -np 1 -fa on -ctk q8_0 -ctv q8_0 -b 2048 -ub 512 -t 8 --no-mmap -rea on --reasoning-budget -1 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0

My use case

  • Hermes agent (on Raspberry Pi 5) → Reddit scraping, job scraping, basic automation
  • Local coding (OpenCode / QwenCode) → small scripts, debugging, patching
  • Occasional infra setup via prompts

Issues I’m facing

  • 35B is too slow
    • Even simple tasks take way too long to respond. Feels unusable for anything iterative.
  • 27B is faster but unreliable
    • Code often breaks
    • Takes 20–30 mins even for simple tasks sometimes

What I’m looking for

  1. Better model + quant recommendations
    • Something that actually works well on a 3090
    • Good balance between speed + coding reliability
  2. Ways to improve throughput (t/s)
    • Are my flags bad?
    • Context size too high?
    • Anything obvious I’m missing?
  3. Auto model loading / routing (Right now I have to):
    • Kill server
    • Paste new command
    • Reload model
  • Is there a way to:
    • Auto-switch models based on request?
    • Or keep multiple models warm and route between them?

What’s your stack?

Thanks in advance for any suggestions or help really appreciate it.

reddit.com
u/Clean_Initial_9618 — 9 days ago

Hey everyone,

I’ve been experimenting with running Qwen models locally on my setup:

GPU: RTX 3090 (24GB VRAM)

RAM: 64GB

CPU: Ryzen 5700X

OS: Windows 11

What I’m currently running

Qwen 3.6 35B (UD Q4_K_M)

llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -ngl 99 -c 131072 -np 2 -fa on -ctk f16 -ctv f16 -b 2048 -ub 512 -t 8 --mlock -rea on --reasoning-budget 2048 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0

Qwen 3.6 27B (UD Q4_K_XL)

llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 196608 -np 1 -fa on -ctk q8_0 -ctv q8_0 -b 2048 -ub 512 -t 8 --no-mmap -rea on --reasoning-budget -1 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0

My use case

  • Hermes agent (on Raspberry Pi 5) → Reddit scraping, job scraping, basic automation
  • Local coding (OpenCode / QwenCode) → small scripts, debugging, patching
  • Occasional infra setup via prompts

Issues I’m facing

  • 35B is too slow
    • Even simple tasks take way too long to respond. Feels unusable for anything iterative.
  • 27B is faster but unreliable
    • Code often breaks
    • Takes 20–30 mins even for simple tasks sometimes

What I’m looking for

  1. Better model + quant recommendations
    • Something that actually works well on a 3090
    • Good balance between speed + coding reliability
  2. Ways to improve throughput (t/s)
    • Are my flags bad?
    • Context size too high?
    • Anything obvious I’m missing?
  3. Auto model loading / routing (Right now I have to):
    • Kill server
    • Paste new command
    • Reload model
  • Is there a way to:
    • Auto-switch models based on request?
    • Or keep multiple models warm and route between them?

What’s your stack?

Thanks in advance for any suggestions or help really appreciate it.

reddit.com
u/Clean_Initial_9618 — 9 days ago