If you use Claude heavily, you've felt this: every session starts from zero. You re-explain context, Claude helps, the window closes, and the next session has no idea what you decided yesterday. The standard workaround is a markdown wiki Claude reads — but as the wiki grows, every "what did we decide about X" question burns thousands of tokens grepping and re-reading whole pages.

I spent the last few weeks building a persistent memory layer to fix both problems. It runs entirely on my own machine, integrates via MCP, and lives between Claude and my existing wiki. Sharing the architecture and what I learned in case anyone wants to build their own.

What it does

Semantic retrieval over my wiki. Instead of Claude grepping pages, my MCP server returns the most relevant chunks for any query in ~50ms. 82% mean token reduction on a 10-query eval set vs the grep+Read baseline. F1 retrieval quality is also better — cheaper and more accurate.
Session crystallization. End-of-session, conversations get compressed into a structured "L4 node" with summary + decisions + open threads, indexed alongside wiki content. Tomorrow I can ask "what did we decide about X" and Claude pulls last session's decision verbatim.
Lazy-spawned local models. Embedder + chat model run as subprocesses that the supervisor spawns on first use and reaps after 1 hour idle. Boot cost is zero — nothing loaded until needed.

The architecture (four layers)

Inspired by Andrej Karpathy's writing on LLM-native wikis, then formalized into a build spec:

L0 — append-only event log (SQLite). Every input/output, content-hashed.
L1 — structured facts with confidence + decay (deferred to next phase)
L2/L3 — derived prose + cross-cutting summaries (the hand-edited wiki plays this role for now)
L4 — crystallized session nodes. Summary, decisions, open threads. Indexed in the same vector store as wiki chunks so retrieval finds both naturally.

The stack

Qdrant in Docker for vector search
llama.cpp running Qwen3-Embedding-4B (GPU) and Qwen3.5-2B-Q4_K_M (CPU)
FastMCP server exposing 7 tools (retrieve, crystallize_session, list_sessions, get_l4_node, index_status, reindex, shutdown_models)
Cowork plugin for Claude Desktop integration; also works with Claude Code via standard MCP config

No cloud, no API keys, $0 marginal cost per query.

Numbers

Token reduction: 82.7% mean, 86.2% median vs grep+Read baseline
Retrieval F1: 0.50 vs 0.20 baseline
Embed cold-start: ~4s. Hot-path p95: 39ms (was 2241ms before fixing one specific bug — see below)
L4 session retrieval eval: 0.920 mean score (gate 0.6)
738 chunks currently indexed across 104 markdown files

The most useful thing I learned

Hot-path retrieve was inexplicably stuck at 2241ms p95 even though the embedding model was fully GPU-resident on a 4070 Ti Super. Spent hours blaming GPU offload, prompt cache, KV pre-allocation. The actual cause: every httpx.post() was opening a fresh TCP connection, and Windows localhost handshakes take ~2 seconds. A 5-line change — switching to a persistent httpx.Client with keep-alive — dropped p95 to 39ms. 57× speedup.

Lesson: latency that's suspiciously consistent (2240, 2237, 2241, 2227, 2239 ms) is a fixed cost, not a compute cost. If your local-MCP integration feels slow on Windows, check connection reuse before you blame the model.

A few other things that surprised me

Qwen3 thinking mode silently consumes the generation budget. Crystallization was returning empty content. Logs showed exactly 2000 tokens generated (the cap). Turned out Qwen3 emits <think>...</think> blocks the chat handler strips before populating message.content. With JSON grammar enforced, the model spent all 2000 tokens "thinking" and never emitted JSON. Fix: pass chat_template_kwargs: {enable_thinking: false} via extra_body (requires --jinja on llama-server).
The MCP plugin needed to register against the right config file. Cowork (Claude Desktop's agentic mode) doesn't read ~/.claude.json like Claude Code does. The first attempt at MCP registration silently went to the wrong file. The fix was packaging the LKS service as a proper Cowork plugin (.plugin bundle) — Cowork has a plugin system distinct from raw MCP server registration. If you're trying to wire a custom MCP server into Cowork, this is the path.

What it doesn't do (yet)

No automatic conversation capture — L0 ingestion is manual or via end-of-session crystallization
No L1 fact extraction yet (next phase) — retrieval is over markdown chunks + L4 nodes today
Wiki is still source-of-truth; no automatic conflict resolution
Solo deployment only; no federation or multi-user
Tested on Windows; Linux/Mac would need a small tweak to the supervisor (it uses subprocess.CREATE_NEW_PROCESS_GROUP for clean Windows termination)

Full write-up

Architecture, phased build narrative, all five lessons-learned bug stories, the setup walkthrough, and the roadmap: https://gist.github.com/tyoung515-svg/5fd5279f46d935f517cda89146c94685

Happy to answer questions on any piece — the MCP integration, the runtime supervisor, the eval harness, the crystallization atomicity contract, whatever's interesting.

Thought I would share my experience over the last few days using the new flash v4 in open code as a scoped task worker.

My basic work flow is idea making with Claude and turn into spec. Fire up a new instance of Claude opus to be the project manager and to decompose the spec into scoped task lists that then get handed to DS flashv4 instances of open code. Worker reports are fed back to opus, with some checkpoints for deeper audits using Google Gemini.

I started flash at phase 4 through the build out of this 9 phase project. I burned roughly 52M credits 2 instances doing the work over two days.

Very few errors, we are talking 2-3 over 5 phases and they surfaced them. They also caught around a dozen minor bugs and fixed them perfectly themselves and documented the why.

Overall, flash has earned it's spot as my main worker for my coding and automation projects. I have not tested it outside this role, but I use multiple model providers to keep the audits adversarial to a degree. DS Pro V4 may do the job well also, but I saved around $600 on this project at zero hit to quality, that's plenty for me.

10/10 recommend. Used DS API key as Open router had constant rate limit issues.

u/Away-Sorbet-9740

Anyone else have a dedicated screen for Opencode?