u/Expensive-Register-5

▲ 50 r/Vllm+1 crossposts

TL;DR: On Qwen 3.6, using qwen3.5-enhanced.jinja with preserve_thinking=true tends to stack broken think markup in the prompt: the model sometimes emits <tool_call> without a closing </think>, the 3.5 template does not repair that, and the 3.6 assistant branch can double-wrap turns—so you get ignored tool calls, reasoning leaking into tool turns, and preserve_thinking=false as a workaround (strip earlier think from history). I ship qwen3.6-enhanced.jinja with a small self-healing step before the reasoning split so </think> is inserted when needed before tool_call>, which makes preserve_thinking usable again for 3.6. Proof repo: qwen36_27B_36jinja_project; templates live beside qwen3.5-enhanced in the same GitHub repo. Launch script in the post is what I run on vLLM v0.19.0 (qwen3_coderpreserve_thinking: trueqwen3.6-enhanced.jinja).

Full write-up (RCA, Jinja snippet, env + vllm serve flags, version note):
https://allanchan339.github.io/bug-fixes/2026/05/02/Qwen36-27B-updated-jinja.html

Previous write-ups: https://www.reddit.com/r/LocalLLM/comments/1sv6cqk/follow_up_tested_tool_calling_fixes_for_qwen/

allanchan339.github.io
u/Expensive-Register-5 — 11 days ago
▲ 12 r/Vllm+1 crossposts

Before reading the real content, can any one tell me what's wrong with LocalLLaMA for banning my post to r/LocalLLaMA ? I cant agree that is not related to local LLM.

https://preview.redd.it/eteotpoxpaxg1.png?width=1466&format=png&auto=webp&s=a7a1032db7d6596f55bb22371dde4a8b7a41330e

TL;DR: After my deep‑dive on Qwen 3.5 tool‑calling and trail on Qwen 3.6-35B-A3B, I took the same enhanced.jinja setup and ran Qwen 3.6‑27B‑FP8 through an unsupervised, agentic build. The catch? Upgrading to NVIDIA Studio Driver 595.79 introduced NCCL deadlocks that required extra config overrides to fix. Once resolved, the model ran for 180 000 tokens without a single malformed tool call, and the finished project is here.

Important: The qwen3.5-enhanced.jinja template requires preserve_thinking=false as it is new feature in Qwen 3.6. If you accidentally set it to true, the qwen3.5-enhanced.jinja will break and tool calls will fail. All examples below assume that flag is correctly set.

1. What Came Before (Recap of the Qwen 3.5 Post)

Last time I spent weeks debugging Qwen 3.5‑27B and 35B‑A3B on a mixed‑GPU rig (RTX 4090 + RTX 3090). The fixes that made agentic work possible:

  • qwen3.5‑enhanced.jinja – a custom interleaved‑thinking template that treats any unclosed <thinking> block as plain content, not as reasoning content. This way the harness sees the tool call directly, even when the model forgets to close the thinking tag – a pattern known as “CoT leakage.” The template must have preserve_thinking=false (the default) or it will not function.
  • Streaming‑parser dependency – the jinja fix relies on the parser processing tokens as they stream, detecting <tool_calling> even when the surrounding <thinking> tag is still open. On Qwen 3.5‑27B, the qwen3_xml parser works perfectly for this and is generally more robust. However, on Qwen 3.6, only the qwen3_coder parser triggers correctly with the unclosed thinking block; qwen3_xml fails to fire the tool call (more on this below).
  • VLLM_TEST_FORCE_FP8_MARLIN=1 – forces the 4090 (SM89) to use W8A16 instead of native W8A8, preventing precision drift between the two GPUs.
  • NCCL tuning (P2P_DISABLE, IB_DISABLE, Ring) – essential for stability on PCIe topologies.

With that stack, Qwen 3.5‑27B ran a 1h 9m continuous agentic session at 138K tokens, building a complete FastAPI + React app without any tool‑calling failures.

2. Qwen 3.6‑27B – The Upgrade Path

A few weeks later I swapped the model to Qwen/Qwen3.6-27B‑FP8 while keeping the same enhanced.jinja template (with preserve_thinking=false). The parser had to change: despite qwen3_xml being the more robust choice on 3.5, on 3.6 it did not trigger tool calls when <thinking> remained unclosed (the exact scenario the template relies on). So I switched to qwen3_coder – a parser that, while less sophisticated with special characters, processes streams aggressively to catch the tool call inside an unclosed thinking block. (After bug fix introduced in this post. I may switch back to qwen3_xml for vLLM 0.20.1 for above.) Everything worked fine on driver 591.86 — so I upgraded to Studio Driver 595.79, expecting better performance. Instead, everything broke:

  • Random NCCL deadlocks – the server would freeze hard mid‑generation, requiring a restart.
  • These deadlocks looked exactly like tool‑calling failures, but the logs pointed to NCCL timeouts, not parser errors.

3. Why qwen3_coder Over qwen3_xml: The "Bug + Bug = Feature" Effect

https://preview.redd.it/cexnoya6oaxg1.png?width=2132&format=png&auto=webp&s=b564ee47a0152fc3aa46f001abd30992c3744fee

The switch to qwen3_coder isn't just a workaround – it's a perfect case of two "bugs" combining into a feature.

  • The model's bug (CoT leakage): Qwen 3.6 sometimes fails to close its <thinking> tag before outputting a <tool_call>. The enhanced.jinja template deliberately ignores this unclosed block and leaves the tool call as plain content, preserving the intent.
  • The parser's "bug" (aggressive streaming): qwen3_coder is designed to process code‑related outputs and is much more aggressive in detecting tool‑call patterns mid‑stream, even when the surrounding XML is malformed. It doesn't require a fully‑closed <thinking> context; it just sees <tool_call> and fires. In contrast, qwen3_xml is a proper XML parser that expects a well‑formed document, so an unclosed <thinking> tag scuppers its ability to find the nested <tool_call>.

When the model's "leakage" meets the parser's "roughness", you get more resilient tool‑call extraction – the exact outcome you want. Neither the model bug nor the parser's intolerance to XML imperfections is ideal on its own, but together they become a production‑grade feature. This is why I stick with qwen3_coder for Qwen 3.6 and why qwen3_xml – despite being more robust with special characters – simply cannot handle the unclosed‑thinking scenario on this model.

Dear vLLM dev, please let me know if my verdict is correct.

4. The Real Culprit: Driver 595.79 Broke Things

Here's the twist: I was on 591.86 and things were mostly working. I upgraded to Studio Driver 595.79 expecting improvements — instead, it introduced NCCL deadlocks that froze the server mid‑generation.

The new driver appears to tighten NCCL behaviour on mixed‑GPU PCIe topologies, breaking vLLM's custom all‑reduce path. The fix wasn't rolling back the driver — it was enforcing the right overrides:

  1. New NCCL env vars:
  2. --disable-custom-all-reduce – forces vLLM to use native NCCL all‑reduce instead of its custom implementation, which is on PCIe‑only topologies.

Without these overrides on 595.79, you'll get random deadlocks that masquerade as tool‑calling failures.

5. The 180K‑Token Agentic Run

With the driver + NCCL fixes in place, I gave Qwen 3.6‑27B full ownership of a folder and a $10 000 token budget. No hand‑holding.

Prompt Wall Time Accumulated Tokens
“Welcome to life, you are Qwen 3.6‑27B. Full leadership. What project do you want to build?” 0s 0k
“Don't ask me – you have full leadership. $10k token budget.” (model used a Question tool to clarify, then proceeded) 31s 14.0k
“Did you check if this is bug‑free? It's your own project.” 17m 13s 63.3k
“Deliver the first possible functional upgrade. Do it nicely.” 11m 35s 126.7k
(session ended naturally) 10m 46s 180.0k

Result: The model chose to build a modern web app (React + Vite + TypeScript, with a FastAPI backend), iterated on it after critical feedback, and delivered a polished upgrade – all without a single malformed tool call. The finished code is on GitHub.

6. The Recipe (Copy‑Paste Ready)

#!/bin/bash
# -------------------------------------------------
# Qwen 3.6‑27B‑FP8 – Agentic‑Ready vLLM Launch Script
# Tested: 180K tokens, zero tool‑calling failures
# Driver: NVIDIA Studio 595.79
# -------------------------------------------------

# ---- Safe, Speed‑Focused Env Vars ----
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_CUMEM_ENABLE=0
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1

export OMP_NUM_THREADS=8

# ---- NCCL Tuning for SYS/PCIe Topology ----
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_SHM_DISABLE=0          # NEW for driver 595.79
export NCCL_ALGO=Ring
export NCCL_P2P_LEVEL=LOC          # NEW for driver 595.79

# ---- vLLM Stability (Driver‑Dependent) ----
export VLLM_RPC_TIMEOUT=180                  # NEW
export VLLM_WORKER_MULTIPROC_METHOD=spawn    # NEW

# ---- FP8 & Memory ----
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_SLEEP_WHEN_IDLE=1

# Clean stale FlashInfer cache
rm -rf ~/.cache/flashinfer

# Activate environment
source /home/cychan/vLLM/.venv/bin/activate

vllm serve Qwen/Qwen3.6-27B-FP8 \
  --served-model-name Qwen3.5-27B \
  --chat-template qwen3.5-enhanced.jinja \
  --default-chat-template-kwargs '{"preserve_thinking": false}' \   # MANDATORY: the enhanced jinja will break if this is true
  --attention-backend FLASHINFER \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --max-model-len 219520 \
  --gpu-memory-utilization 0.91 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-batched-tokens 12288 \
  --max-num-seqs 4 \
  --kv-cache-dtype fp8 \
  --tool-call-parser qwen3_coder \          # REQUIRED for Qwen 3.6 with enhanced.jinja; for Qwen 3.5 27B, qwen3_xml also works (see https://www.reddit.com/r/Vllm/comments/1suasv2/)
  --reasoning-parser qwen3 \
  --no-use-tqdm-on-load \
  --host 0.0.0.0 \
  --port 8000 \
  --language-model-only \
  --disable-custom-all-reduce            # CRITICAL for driver 595.79

7. Key Takeaways

  1. Parser choice depends on the model – The enhanced jinja template relies on a streaming parser to catch tool calls inside unclosed <thinking> blocks. On Qwen 3.5‑27B, the qwen3_xml parser works fine and is generally more robust (see this detailed post). On Qwen 3.6, qwen3_xml fails to trigger the tool call in that exact scenario, so I use qwen3_coder instead.
  2. preserve_thinking must be false – The enhanced jinja template will not work with preserve_thinking=true. This is a new feature in Qwen 3.6 where qwen3.5-enhanced.jinja is not compatible.
  3. The NVIDIA driver upgrade can break things – going from 591.86 to 595.79 introduced NCCL deadlocks on my mixed‑GPU setup. The fix requires new NCCL env vars and --disable-custom-all-reduce. If you're on 595.79 without these overrides, you'll hit random deadlocks that masquerade as tool‑calling failures.
  4. The original Qwen 3.5 fixes still standVLLM_TEST_FORCE_FP8_MARLIN=1 remains non‑optional on mixed‑GPU setups to avoid precision drift, and the same NCCL tuning (updated for the new driver) is mandatory.
  5. Qwen 3.6‑27B is not just an incremental step – it's a dense 27B model that beats the old MoE flagship Qwen 3.5‑397B‑A17B on core agentic‑coding benchmarks (SWE‑bench Verified 77.2 vs 76.2, Pro 53.5 vs 50.9, SkillsBench 48.2 vs 30.0), making it a generational upgrade rather than a refinement.
  6. 180K tokens is the new normal – the system handled a 10‑minute uninterrupted agentic session with zero tool‑calling errors, demonstrating production‑grade stability on consumer hardware.

Bottom line: The enhanced.jinja template works on both Qwen 3.5 and 3.6, provided preserve_thinking=false and you choose the right parser for the model. Combined with the new driver workarounds, the stack yields a rock‑solid 180K‑token agentic run. Full recipe in Original Qwen 3.5 deep‑dive – go build.

Links:

reddit.com
u/Expensive-Register-5 — 19 days ago
▲ 52 r/Vllm+1 crossposts

Update 1: toggled preserve_thinking on to see if tool calling problem fixed, doesnt work.

TL;DR: Following up on the Qwen 3.5 thread — after everyone kept asking about 3.6, I set it up using the same qwen3_xml + enhanced.jinja fixes and ran real agentic tests. Here's the honest result: my config is still the most stable, but compared to Qwen3.5-27B, Qwen3.6-35B-A3B is notably more loopy and has a higher chance of malformed tool calls interrupting an agentic process.

The Short Story

After spending weeks ironing out Qwen 3.5-27B/35B for agentic use — same fixes, same template, same GPU tuning — people on Reddit kept asking about Qwen 3.6.

So I set it up and ran real agentic tests. Gave the model full ownership of the folder, and asked it to build a full-stack project with frontend and backend, with a prompt of $10k token budget. Wanted to see how it holds up in practice.

My config (enhanced.jinja + qwen3_xml) is still the most stable option. But compared to Qwen3.5-27B, Qwen3.6-35B-A3B has two new problems:

  1. More looping — the model gets stuck in reasoning loops more often

https://preview.redd.it/jbzl0ew5tcwg1.png?width=3482&format=png&auto=webp&s=fb0757f5e0d69ba6a74413506418a6b89489fa12

  1. Malformed tool calls interrupting agentic flow — higher chance of breaking mid-task, even with the same config that works perfectly on 3.5

What Carried Over (Still Works)

qwen3_xml parser

Registry-based parser handles complex tool arguments without corruption. Official docs still say qwen3_coder. I still say no.

qwen3.5-enhanced.jinja template

The interleaved thinking template works on 3.6 35B-A3B. Proper </thinking> tag handling, clean tool call formatting.

Precision drift on mixed GPUs

RTX 4090 (SM89) wants W8A8, RTX 3090 (SM80) falls back to W8A16. VLLM_TEST_FORCE_FP8_MARLIN=1 still forces both to match. Without it, conversations drift.

NCCL tuning

Same setup: NCCL_P2P_DISABLE=1, NCCL_IB_DISABLE=1, NCCL_ALGO=Ring. Same reason: mixed topology stability.

Real Agentic Test: Three Runs

I gave each trail the same prompt: full ownership of the folder, build a full-stack project with frontend and backend, $10k token budget.

Run 1: enhanced.jinja + qwen3_xml (my config)

This is the one that lasted the longest. The model want to build a oss-inspect project for automauous codebase quality analysis.

Prompt Accumulated Tokens
Project setup 13.9k
"Did you check if this is bug free? This is your own project." 135.1K
DCP sweep auto-triggered 107.0K
"Fix it then" 110.0K
Model died - improper tool calling 111.1K

This config survived to ~130K+ tokens (with 13m 20s) before dying from improper tool calling. The DCP sweep at 135K dropped it to 107K, but it kept going. For context, the 3.5 27B model with the same setup routinely goes 130K+ without any interruption.

Run 2: official.jinja + qwen3_coder

https://preview.redd.it/xruaxzmmscwg1.png?width=3512&format=png&auto=webp&s=cb4c773a36b91a4f6312b32404a453098501b4de

**For simplicity i didnt change the served-name in vllm, the model is actually is Qwen3.6-35B-A3B**

This model wanted to build a knowledge graph platform for graphify. (the skill ingestion is a bit aggressive ah?)

Died in 6m 32s — improper tool calling. Failed too early to be reliable for agentic tasks.

Run 3: official.jinja + qwen3_xml

https://preview.redd.it/1qvkpcpltcwg1.png?width=3530&format=png&auto=webp&s=95a9445b63b5c9db38d0bab1dec85d4984ed3956

This time the model wanted to build TaskFlow — a Kanban project management app with authentication, drag-and-drop task management, and a polished UI.

Died in 1m 16s — malformed tool calls inside the thinking box. Failed too early to be reliable for agentic tasks.

https://preview.redd.it/450bg6lntcwg1.png?width=3530&format=png&auto=webp&s=f0697dcae6870265de7c3de03cf9e6757315e3d1

Run 4: Enabled preserve thinking

https://preview.redd.it/05yxfedi1dwg1.png?width=3588&format=png&auto=webp&s=3f1e4d9a524acfe76d44e42b14f38ca8c4873391

This time the model wanted to build a Knowledge Discovery Engine — an end-to-end system that crawls web content with agent-browser, builds knowledge graphs with graphify, and provides an interactive visual explorer with surprising insights and knowledge gap analysis.

However, this time the model start looping itself, keep trying to call sub-agent (disabled) and keep modifying the todo list but dont write a single code.

Verdict:

  --default-chat-template-kwargs '{"preserve_thinking": true}' \

dont help.

Remarks

For the tech stack the model is using, I have 0 knowledge about it.

Comparison Summary

Config Survival Failure Mode
enhanced.jinja + qwen3_xml ~111K tokens (13m 20s) Improper tool calling (died)
official.jinja + qwen3_coder 6m 32s Improper tool calling
official.jinja + qwen3_xml ~1m 16s Malformed tool calls in thinking box

For comparison, the same test on Qwen3.5-27B with enhanced.jinja + qwen3_xml reliably runs 130K+ tokens before dying. 3.6 35B-A3B has a noticeably higher failure rate even with the best config. Qwen3.5-27B is still the most stable model for agentic work, despite its much slower TTFT.

New Problems Specific to Qwen3.6-35B-A3B

1. More Loopy

The model gets stuck in reasoning loops more often. It'll loop through the same analysis step multiple times, consuming tokens, before eventually moving forward. This isn't a template issue — it's a model behavior change. On 3.5 27B this happened occasionally. On 3.6 35B-A3B it's frequent enough to meaningfully impact long sessions.

2. Malformed Tool Calls Interrupt Agentic Flow

Even with enhanced.jinja + qwen3_xml (the config that works perfectly on 3.5 27B), 3.6 35B-A3B has a higher chance of generating malformed tool calls that break the agentic process. The tool calling format still uses XML and is technically correct — but the frequency is higher and the damage is worse: an interrupted session that can't recover.

On 3.5 27B, a malformed tool call is a rare edge case after patching the template. On 3.6 35B-A3B, it's a much more regular occurrence that will eventually kill a long-running agentic session, no matter which config you use.

The Fix (Partial)

OpenCode 1.4.18 helps. The older version had tool calling issues that made things worse, this is especially true for the "question" tool. Upgrading to 1.4.18 resolved this issue of the malformed tool call problems.

But here's the honest part: upgrading the client doesn't solve the looping or the inherently higher failure rate on 3.6. The root cause is still in the model (or template?).

My Config

vLLM Version: 0.19.1 Transformers Version: 5.5.4 CUDA Version: 12.8.1 (nvcc 12.8.93)

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export NCCL_CUMEM_ENABLE=0
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1
export OMP_NUM_THREADS=4
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
export NCCL_ALGO=Ring
export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_SLEEP_WHEN_IDLE=1

rm -rf ~/.cache/flashinfer

vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --served-model-name Qwen3.6-35B-A3B \
  --chat-template qwen3.5-enhanced.jinja \
  --attention-backend FLASHINFER \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.91 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-batched-tokens 12288 \
  --max-num-seqs 4 \
  --kv-cache-dtype fp8 \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --no-use-tqdm-on-load \
  --host 0.0.0.0 \
  --port 8000 \
  --language-model-only

Bottom Line

My config (enhanced.jinja + qwen3_xml + OpenCode 1.4.18) is still the best I can do on Qwen3.6 35B-A3B. But it's worth being honest: Qwen3.6-35B-A3B is more loopy and has a higher failure rate for agentic tool calling compared to Qwen3.5-27B. It is quite surprising that the tool calling issues presents again on 3.6 35B-A3B. The root cause is still unknown (maybe preserved thinking is one of the reasons?)

Comparing Qwen3.5-27B, Qwen3.5-35B-A3B and Qwen3.6-35B-A3B, all three models official template are the same. It may reveal that Qwen team has his special treatment for the tool calling issues, if they decided to launch Qwen3.6 flash model.

I've decided to stick with Qwen3.5-27B-FP8. For agentic obedience — following instructions, executing tool calls cleanly, not looping — the 27B model outperforms the 3.6 35B-A3B in this regard (in my testing). 3.6 has much faster TTFT, similar ability to Qwen3.5-27B (by AA benchmark), but it pays for it with looping and tool call failures that kill long sessions. Reliability over raw intelligence for agentic work.

reddit.com
u/Expensive-Register-5 — 23 days ago
▲ 2 r/Vllm+1 crossposts

https://preview.redd.it/6gan8kwqn3wg1.png?width=1408&format=png&auto=webp&s=d4ae81e09cac21816b2acb6208162ea6aab684a0

TL;DR: Every tutorial says "set ANTHROPIC_CUSTOM_MODEL_OPTION and you're done." This is wrong. That config does NOT work for local models. The real solution requires 4 specific settings that no tutorial mentions together. Here's the working config so you don't hit the same blockers.

Note on vLLM setup: If you're just getting started with Qwen 3.5 on vLLM (Jinja templates, parser choices, etc.), I documented those issues here: https://www.reddit.com/r/Vllm/comments/1skks8n/qwen_35_27b35ba3b_tool_calling_issues_why_it/ - this post assumes vLLM is already running.

The Story (So You Don't Repeat It)

I've got Qwen 3.5-27B running on vLLM. Direct API calls work perfectly:

curl http://127.0.0.1:8000/v1/chat/completions -X POST \
  -d '{"model":"Qwen3.5-27B","messages":[{"role":"user","content":"test"}]}'
# ✅ Works

So I thought "Claude Code should be easy."

Spoiler: It wasn't. After testing multiple configurations and reading through Claude Code's source code, I found the working setup. Here's what actually works.

The Trap: The "Obvious" Fix That Doesn't Work

What Every Tutorial Tells You

The official Claude Code docs say:

>Use ANTHROPIC_CUSTOM_MODEL_OPTION to add a custom entry to the /model picker. Claude Code skips validation for the model ID set in this variable.

So I set it:

{
  "ANTHROPIC_CUSTOM_MODEL_OPTION": "Qwen3.5-27B",
  "ANTHROPIC_BASE_URL": "http://127.0.0.1:8000"
}

Result: There's an issue with the selected model (Qwen3.5-27B). It may not exist or you may not have access to it.

Why It Doesn't Work

The docs are misleading. ANTHROPIC_CUSTOM_MODEL_OPTION:

  • ✅ Adds an entry to the /model picker
  • ❌ Does NOT bypass validation when using --model flag
  • ❌ Does NOT bypass validation when using settings.json
  • ❌ Only works if you manually select it from the picker (which defeats the purpose)

This is a known bug documented in GitHub issues #18025, #23266, #34821. But the docs haven't been updated.

Lesson: When the official docs don't work, read the source code.

The Breakthrough: Reading Source Code

Eventually, I gave up on tutorials and started reading Claude Code's cli.js (~50K lines of minified code).

I searched for the error message:

grep -n "There's an issue with the selected model" ~/.nvm/versions/node/*/lib/node_modules/@anthropic-ai/claude-code/cli.js

Found it around line 5146. The relevant code (deobfuscated):

if (q instanceof AnthropicError && q.status === 404) {
  // Reject custom models on 404
  return {
    content: `There's an issue with the selected model (${K}). 
              It may not exist or you may not have access to it.`,
    error: "invalid_request"
  }
}

The real issue: Claude Code makes validation requests, gets 404s from vLLM (because the model name doesn't match Anthropic's hardcoded list), and rejects it before even trying the actual API call.

This is client-side validation that happens before any network request to your server.

The Actual Fix

After testing various environment variables, I found that CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 helps suppress some of these validation checks. This is not documented anywhere but it's critical.

This is the line every tutorial misses.

The Complete Working Config (Tested, Not Copied)

Step 1: ~/.claude/settings.json

{
  "model": "sonnet",
  "env": {
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:8000",
    "ANTHROPIC_AUTH_TOKEN": "dummy",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "Qwen3.5-27B",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "Qwen3.5-27B",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "Qwen3.5-27B",
    "API_TIMEOUT_MS": "3000000",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
  }
}

The 4 critical lines (get any wrong = errors):

Line Why It Matters What Happens If Wrong
"model": "sonnet" + ANTHROPIC_DEFAULT_SONNET_MODEL Use alias AND map it (both required) Validation rejects custom names OR Claude doesn't know what "sonnet" means
ANTHROPIC_BASE_URL: :8000 Root endpoint, not /v1 Double /v1/v1/messages = 404
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC: "1" Suppresses client-side validation Intermittent validation failures

Step 2: vLLM Setup

Assumes vLLM is already running (covered in Part 1). Just ensure:

  • --served-model-name Qwen3.5-27B matches settings.json exactly
  • No / in the model name
  • vLLM is accessible at http://127.0.0.1:8000

Step 3: Test

claude "test"
# ✅ "I'm ready to help! How can I assist you today?"

If this fails, one of the 4 critical lines is wrong. Check them in order.

My Complete Debugging Journey (So You Don't Repeat It)

Attempt 1: vLLM Official Docs

"ANTHROPIC_BASE_URL": "http://127.0.0.1:8000/v1"  // ❌

Error: API Error: 404

Why: Docs don't mention Claude adds /v1/messages automatically. Double /v1 breaks everything.

Attempt 2: GitHub Issue #18025

"model": "Qwen3.5-27B"  // ❌

Error: There's an issue with the selected model

Why: No mention of alias mapping. Claude validates against Anthropic's list.

Attempt 3: Reddit Solutions

--served-model-name Qwen/Qwen3.5-27B  // ❌ Has /

Error: Model not found

Why: Settings had Qwen3.5-27B (no /), mismatch.

Attempt 4: ANTHROPIC_CUSTOM_MODEL_OPTION (Official Docs)

{
  "ANTHROPIC_CUSTOM_MODEL_OPTION": "Qwen3.5-27B",
  "ANTHROPIC_BASE_URL": "http://127.0.0.1:8000"
}

Error: Still got validation errors

Why: The docs say this "skips validation" but it only adds an entry to the /model picker. It doesn't bypass validation when using settings.json.

This is the biggest trap. The docs are misleading.

Attempt 5: Discord Advice

"ANTHROPIC_API_KEY": "dummy"  // ❌

Error: Authentication issues

Why: ANTHROPIC_AUTH_TOKEN works better with vLLM.

Attempt 6: Missing Validation Suppression

// No CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC

Error: Intermittent validation failures (works sometimes, fails others)

Why: Claude still tries to validate custom models against Anthropic's list.

Attempt 7: The Complete Solution

{
  "model": "sonnet",
  "env": {
    "ANTHROPIC_BASE_URL": "http://127.0.0.1:8000",
    "ANTHROPIC_AUTH_TOKEN": "dummy",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "Qwen3.5-27B",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1"
  }
}

Result: ✅ Finally works

Why This Works (The Part Tutorials Skip)

Model Alias Mapping

Claude Code uses three model tiers internally:

  • Opus - Complex reasoning
  • Sonnet - Daily coding (default)
  • Haiku - Fast tasks

When you set "model": "sonnet", Claude looks up what "sonnet" means via ANTHROPIC_DEFAULT_SONNET_MODEL. If you set "model": "Qwen3.5-27B" directly, Claude tries to validate it against Anthropic's hardcoded model list and rejects it.

The mapping:

"model": "sonnet"  // ← Claude sees this
"ANTHROPIC_DEFAULT_SONNET_MODEL": "Qwen3.5-27B"  // ← This tells Claude what "sonnet" means

Endpoint Path

Claude constructs URLs as:

{ANTHROPIC_BASE_URL}/v1/messages

Tutorials say ANTHROPIC_BASE_URL=http://127.0.0.1:8000/v1:

Final: http://127.0.0.1:8000/v1/v1/messages  ❌ 404

Correct:

Final: http://127.0.0.1:8000/v1/messages  ✅ Works

Validation Suppression (The Missing Piece)

CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 tells Claude Code to skip certain validation checks and non-essential API calls. This is critical for local models because:

  1. Claude Code makes validation requests to check if models exist
  2. These requests hit Anthropic's model list, not your vLLM server
  3. Custom models fail validation and get rejected
  4. This flag suppresses some of those checks

Without this flag, you'll get intermittent validation errors even with correct alias mapping.

This is not documented anywhere. I found it by testing environment variables after reading the source code.

Common Errors (And Their Causes)

Error Cause
"There's an issue with the selected model" Using custom name in "model" field
"API Error: 404" ANTHROPIC_BASE_URL includes /v1
Model not found --served-model-name has /
Intermittent validation failures Missing CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC
ANTHROPIC_CUSTOM_MODEL_OPTION doesn't work Docs are wrong

The Checklist (Use This, Not Tutorials)

Before running claude, verify:

□ "model": "sonnet" (NOT custom name)
□ ANTHROPIC_DEFAULT_SONNET_MODEL set to your model
□ ANTHROPIC_BASE_URL ends at :8000 (NO /v1)
□ CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC: "1" (CRITICAL!)
□ --served-model-name matches settings.json exactly (NO /)
□ vLLM is running and accessible
□ Do NOT use ANTHROPIC_CUSTOM_MODEL_OPTION (it doesn't work)

If all checked and it still fails, paste your settings.json - one of these is wrong.

Key Takeaways

  1. ANTHROPIC_CUSTOM_MODEL_OPTION does NOT work - The docs are wrong. Don't waste time on it.
  2. Use model aliases - "model": "sonnet", not your custom name
  3. Map aliases - ANTHROPIC_DEFAULT_*_MODEL tells Claude what each alias means
  4. Root endpoint - ANTHROPIC_BASE_URL should be :8000, not :8000/v1
  5. Exact model names - --served-model-name must match settings.json exactly (no /)
  6. Suppress validation - CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 is critical (not documented!)
  7. Read source code - When docs don't work, the source code has the truth
  8. Don't trust tutorials - Most online configs miss 1-2 critical details that break everything

Resources

If you're trying to use Claude Code with local models, skip the tutorials. Use the config above. Especially skip ANTHROPIC_CUSTOM_MODEL_OPTION - it's documented but broken, and it will waste your time.

Happy coding! 🚀

reddit.com
u/Expensive-Register-5 — 25 days ago