u/LaughterOnWater

I spent a while getting this dialed in and wrote up the full recipe. Short version:

  • 35B MoE TQ3_4S fits in 12.4GB of weights
  • KV cache at q8_0/q8_0 and 262K context only uses 2.7GB because MoE only has 10 attention layers out of 40
  • Total VRAM: ~16GB, leaving ~7GB headroom on a 3090
  • ~111 tok/s generation

The thing that surprised me most was how little the KV cache costs at full context. I kept expecting it to OOM and it just... didn't. The math on MoE attention layers makes a real difference at this context length.

The guide covers building llama.cpp-tq3 from source, tuning the KV cache, and wiring it up to OpenCode running in WSL. There's a WSL networking gotcha that cost me an afternoon: host.docker.internal doesn't work. You need the nameserver IP from /etc/resolv.conf.

u/LaughterOnWater — 11 days ago

https://preview.redd.it/1j1pw3dd6syg1.png?width=657&format=png&auto=webp&s=0980ca403ccb4937c629ada48fec415e6ace3981

I am trying to get a live tokens per second display running on the edge of my OpenCode TUI, but every plugin and tool I have tried has failed. I am using WSL2 on Windows 10 with OpenCode version 1.14.32. My LLM backend is llama-server.exe running natively on Windows, and my OpenCode config points to a local Qwen model. That all works fine. I just launch OpenCode in the WSL terminal with the plain opencode command.

I have tried four different tools. First was the guard22 opencode tps meter (pictured above), which is a TUI patcher. It failed because the auto patcher cannot patch version 1.14.31 or 1.14.32. Second was lemantorus opencode analytics, which is a terminal dashboard. It fails with a message saying it failed to load data because it cannot find the OpenCode server. Third was tokentop ttop, an npm package that fails with a missing module error for lifecycle.ts. Fourth was opencode token tracker, which I added as a plugin to my opencode.json file. It does nothing. No notifications at all.

The TUI clearly has a running server because otherwise I would not be able to edit code, but I can. What I am looking for is a simple terminal based display for tokens per second. It needs to work with OpenCode 1.14.x on WSL2. It can be a plugin, a sidecar tool, or even a built in command that I have missed.

Does anyone have suggestions for a working TPS | tok/s display for OpenCode on WSL2? Come to think of it, shouldn't opencode have this in core, something that could be activated or hidden in opencode.json?

Thanks in advance.

reddit.com
u/LaughterOnWater — 12 days ago