u/Electronic-Fly-6465

▲ 74 r/Qwen_AI

Local models are no longer “toy” versions of frontier models. They are becoming serious operators

The Qwen team deserves a massive round of applause.

I’ve been running Qwen 3.5 locally for agentic software work, and recently moved to Qwen 3.6. This generation is unreal.

Not “good for a local model”.

Not “impressive considering the size”.

Actually good.

I’m running it through my own agent harness, Link, on a vLLM server. The current setup is on an RTX 6000-class workstation GPU, so no, this is not a normal gaming-PC setup. I’m seeing roughly 180 tokens/sec with multiple sequences and very little waiting.

But the impressive part is not just the hardware.

The impressive part is that a 35B-class local model is now stable enough to operate inside a complex tool environment, diagnose failures in that environment, repair them, restart the proxy serving its own tokens, and continue coherently after the restart.

That is the part that made me stop and stare.

The recent example was a vLLM proxy issue involving streamed reasoning, tool calls, and a custom reasoning guard.

The guard exists because I do not treat reasoning output as sacred “deep cognition”. In my workflow, reasoning is a controlled generation scaffold. If the model needs 10,000 tokens of reasoning, I would rather break the task into explicit steps myself. So I cap reasoning effort and push the model back into action.

The guard works extremely well. We are talking roughly sub-200ms transition time before it is producing the next useful token without runaway reasoning.

The bug was not that the guard existed.

The bug was that the proxy was cutting one stream of reasoning from one generation, then stitching it to content from a second generation without a valid boundary.

A Frankenstein stream.

The client was receiving reasoning from generation A and content from generation B.

Here is what blew my mind:

Link diagnosed the bug while being served through the broken proxy.

It was affected by the very issue it was diagnosing. It hit the corrupting token pattern, jittered, recovered, realised that emitting certain delimiter-like tokens was triggering the failure mode, adapted, avoided worsening the stream corruption, diagnosed the proxy bug, patched the fix, restarted the proxy underneath itself, and reported success after the model connection stabilised again.

That is not normal chatbot behaviour.

That is operational stability.

I sent the same problem to Claude first. Claude was still sitting at time-to-first-token while Link had already produced the correct diagnosis. When Claude finally responded, it was wrong. Then when I pasted Link’s diagnosis into Claude, it still initially dismissed the correct explanation and tried to steer the fix against the design philosophy of the system.

This is not a “Claude bad” post. Claude is obviously powerful.

But this experience made something very clear to me:

A smaller local model inside a well-built harness, with good tool feedback, local context, high throughput, and real operational continuity, can beat a frontier model on actual system work.

Not benchmarks.

Not toy prompts.

Real debugging.

Real files.

Real proxy.

Real restart.

Real recovery.

That is the shift.

I think a lot of people are underestimating local models because they are judging them inside weak harnesses, small context windows, bad tool loops, slow inference setups, and generic workflows.

Then they conclude the model is not good enough.

Maybe the model is not the problem.

Maybe the harness is.

Context also matters more than people admit. I ran this harness around 50k–60k context for a while and it worked, but it was constrained. Around 180k–200k context, everything changes. For serious agentic work, especially in a large codebase or messy system, the model needs room to understand the environment.

Below that, it can do tasks.

Above that, it starts to operate.

I’m not saying everyone needs a $15k GPU. You do not. This kind of setup can run through LM Studio, llama.cpp, or vLLM depending on the hardware. A 24GB card can do useful work. A 32GB card is better. Huge context is where it really opens up.

Local models have crossed a line.

They are no longer just “offline assistants”.

They are becoming high-speed local operators.

Link is not properly released yet. Parts are technically on npm, but I have not promoted it because it is not finished, and honestly, it may never feel finished.

What makes Link different is not any single feature. It is the operating model underneath it. Link is built around the idea that the model should not just produce tokens. It should operate.

It should read the system, use tools, watch the results, recover from failures, keep enough live context to understand what just happened, stop itself when it drifts, continue when the next step is obvious, and hand control back when it should.

The harness has side queues, review layers, per-call-type LLM profiles, circuit breakers, transient context injection, session state, proxy integration, and a delta channel for real-time observability.

These are not gimmicks. They are what let a 35B local model operate inside a complex tool environment, hit a bug in its own serving proxy, diagnose it while being affected by it, patch the fix, restart the proxy underneath itself, and recover coherently.

That is the part I find special.

Not that it can call tools.

That it can stay operational while the environment around it is unstable.

I would like to soft-release it rather than do some big fake launch. A few serious users. Real feedback. People who already run local models and understand agent workflows. Especially anyone who can help polish the frontend side, because I am much stronger on backend systems and orchestration than visual UI.

Bottom line:

Local models are not coming. They are already here. And Qwen deserves serious credit for pushing them this far.

reddit.com
u/Electronic-Fly-6465 — 6 days ago
▲ 13 r/Qwen_AI

I’ve been running Qwen3.6 MoE behind a vLLM proxy and hit a specific reliability issue: occasional runaway reasoning loops.

This isn’t a criticism of Qwen3.6. The model is excellent — in my setup, it’s more robust than Qwen3.5 for agentic coding, path handling, debugging, and tool-style workflows. But occasionally, especially on file-path, debugging, and code-tracing prompts, it can get stuck inside a reasoning block and repeat itself endlessly.

At 180+ tokens/sec, even a 20–30 second loop burns through a lot of tokens, blocks GPU time, and stalls agents.

So I built a Reasoning Guard at the proxy layer.

Architecture

Client → Proxy → vLLM → Model

The proxy watches the streaming response as it leaves vLLM. It doesn’t modify the model weights, it doesn’t require a second LLM call, and it doesn’t use embeddings or semantic analysis. It just applies cheap, deterministic checks while the stream is active.

What It Checks

The guard currently monitors:

Reasoning token caps (configurable by effort level)

Repeated paragraph detection

Sliding-window n-gram repetition

Repeated sentence fingerprinting

Fuzzy opening-pattern detection (catches loops like “Actually, I think I’ve found it…”)

Cut-and-continue recovery path

Recovery Flow

When the guard triggers, it:

Stops the upstream stream

Captures the reasoning produced so far

Reissues the request with that reasoning baked in as prior assistant context

Disables thinking for the continuation

Merges phase 1 and phase 2 usage stats

Because vLLM prefix caching is already active, the continuation is effectively seamless. Phase 2 usually resumes with ~50–100ms TTFT, so the client just sees reasoning flow directly into the final answer instead of hanging.

Good reasoning still comes through. The guard only steps in when reasoning exceeds configured limits or starts showing repetition patterns.

Why This Exists

This isn’t trying to compete with provider-side reasoning controls. OpenAI, Anthropic, DeepSeek, and others already have model/API-level systems for this. This is narrower: a practical runtime guard for teams running their own inference stack who want deterministic protection from runaway reasoning without changing the model or swapping proxies.

Observability

My proxy logs each trigger with:

Whether the guard fired

Trigger reason

Token cap used

Reasoning token count

Merged total usage

Stream-end metadata

I’ve tested it against both normal requests and stress cases derived from real trace logs. The loop detector catches repeated paragraphs, n-gram repetition, recurring sentence patterns, and common reasoning-loop openings. The cut-and-continue path has been validated end-to-end through the live proxy.

Result

Before: Occasional 2000+ token reasoning blocks that went nowhere.

After: The model still reasons when useful, but runaway thinking gets cut and redirected into an answer.

It’s basically a proxy-level seatbelt for local LLM inference.

Not magic. No model surgery. Just stream interception, token counting, loop detection, and a clean recovery path.

I would love to discuss other neat mitigations like this that help smaller models operate more effectively.

I’ve been running Qwen3.6 MoE behind a vLLM proxy and hit a specific reliability issue: occasional runaway reasoning loops.

This isn’t a criticism of Qwen3.6. The model is excellent — in my setup, it’s more robust than Qwen3.5 for agentic coding, path handling, debugging, and tool-style workflows. But occasionally, especially on file-path, debugging, and code-tracing prompts, it can get stuck inside a reasoning block and repeat itself endlessly.

At 180+ tokens/sec, even a 20–30 second loop burns through a lot of tokens, blocks GPU time, and stalls agents.

So I built a Reasoning Guard at the proxy layer.

Architecture

Client → Proxy → vLLM → Model  

The proxy watches the streaming response as it leaves vLLM. It doesn’t modify the model weights, it doesn’t require a second LLM call, and it doesn’t use embeddings or semantic analysis. It just applies cheap, deterministic checks while the stream is active.

What It Checks

The guard currently monitors:

  • Reasoning token caps (configurable by effort level)
  • Repeated paragraph detection
  • Sliding-window n-gram repetition
  • Repeated sentence fingerprinting
  • Fuzzy opening-pattern detection (catches loops like “Actually, I think I’ve found it…”)
  • Cut-and-continue recovery path

Recovery Flow

When the guard triggers, it:

  1. Stops the upstream stream
  2. Captures the reasoning produced so far
  3. Reissues the request with that reasoning baked in as prior assistant context
  4. Disables thinking for the continuation
  5. Merges phase 1 and phase 2 usage stats

Because vLLM prefix caching is already active, the continuation is effectively seamless. Phase 2 usually resumes with ~50–100ms TTFT, so the client just sees reasoning flow directly into the final answer instead of hanging.

Good reasoning still comes through. The guard only steps in when reasoning exceeds configured limits or starts showing repetition patterns.

Why This Exists

This isn’t trying to compete with provider-side reasoning controls. OpenAI, Anthropic, DeepSeek, and others already have model/API-level systems for this. This is narrower: a practical runtime guard for teams running their own inference stack who want deterministic protection from runaway reasoning without changing the model or swapping proxies.

Observability

My proxy logs each trigger with:

  • Whether the guard fired
  • Trigger reason
  • Token cap used
  • Reasoning token count
  • Merged total usage
  • Stream-end metadata

I’ve tested it against both normal requests and stress cases derived from real trace logs. The loop detector catches repeated paragraphs, n-gram repetition, recurring sentence patterns, and common reasoning-loop openings. The cut-and-continue path has been validated end-to-end through the live proxy.

Result

Before: Occasional 2000+ token reasoning blocks that went nowhere.
After: The model still reasons when useful, but runaway thinking gets cut and redirected into an answer.

It’s basically a proxy-level seatbelt for local LLM inference.

Not magic. No model surgery. Just stream interception, token counting, loop detection, and a clean recovery path.

I would love to discuss other neat mitigations like this that help smaller models operate more effectively.

reddit.com
u/Electronic-Fly-6465 — 15 days ago