u/CodeGrizzly0214

Last year, I built an AI rig. Glad it was last year, I would not be able to afford the price of parts this year.

I recently switched from Ollama in my docker stack to llama-swap, which opened up so many more models, and allowed for fine turning.

I experimented with several models and configurations for local coding. I'm now using OpenCode with Oh-My-OpenAgent. I setup llama-swap to load Lorbus/Qwen3.6-27B-int4-AutoRound on a pair of 3090s joined with NVLink. OpenCode and Oh-My-OpenAgent are pointed to that config for most things. It has been amazing. I'm getting about 80 tps and can maintain a 262K context. The large context is great for long coding sessions.

Anyway, thought I'd share the configuration in llama-swap, get any suggestions the hive mind might have.

"qwen3.6-27b-vllm-262k":
    name: "Qwen 3.6 27B INT4 AutoRound (vLLM — NVLink Pair — 262K ctx)"
    description: "Dual-3090 recipe: MTP n=3 + fp8 KV + 262K ctx + vision + tools. ~71/89 TPS"
    checkEndpoint: /v1/models
    ttl: 0
    cmdStop: docker stop vllm-qwen36-27b-262k || true
    cmd: |
      docker run --rm --init
        --name vllm-qwen36-27b-262k
        --runtime=nvidia
        --gpus '"device=1,2"'
        --network ${docker-net}
        --shm-size=16g
        --ipc=host
        -e NCCL_P2P_DISABLE=0
        -e NCCL_P2P_LEVEL=NVL
        -e NCCL_CUMEM_ENABLE=0
        -v /mnt/models/huggingface:/root/.cache/huggingface
        -v /mnt/models/vllm-cache:/root/.cache/vllm
        -v /opt/ai/vllm-src:/opt/vllm-src:ro
        vllm/vllm-openai:latest
        --model "Lorbus/Qwen3.6-27B-int4-AutoRound"
        --served-model-name "qwen3.6-27b-vllm-262k"
        --quantization auto_round
        --dtype float16
        --tensor-parallel-size 2
        --gpu-memory-utilization 0.85
        --max-model-len 262144
        --max-num-seqs 4
        --max-num-batched-tokens 4128
        --kv-cache-dtype fp8_e5m2
        --enable-chunked-prefill
        --enable-prefix-caching
        --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
        --enable-auto-tool-choice
        --tool-call-parser qwen3_coder
        --trust-remote-code
        --default-chat-template-kwargs '{"enable_thinking": false}'
    proxy: "http://vllm-qwen36-27b-262k:8000"

Qwen3.6-27B-int4-AutoRound with OpenCode has been a game changer