u/Exact_Law_6489

Hello everyone!

Last time we looked at reasoning models: how RLVR training lets models develop their own internal "thinking" style, why DeepSeek R1 was a landmark open release, why closed labs started hiding their chains of thought, and why writing "think step by step" into a system prompt actively hurts a modern reasoning model. The short version: the model already knows how to reason, you cannot improve on its learned policy with hand-written instructions, and your job is to give it good targets, not to choreograph how it thinks.

For anyone who missed the earlier days:

Today we are starting the BYOK and local LLM arc: what it actually means to bring your own key or run your own model, the three tiers people pick between, and why so much of this community has been quietly migrating away from app-bundled subscriptions over the last year.

This is the orientation post. The deeper technical pieces (VRAM math, quantisation, GGUF, llama.cpp vs ollama vs LM Studio vs vLLM, what hardware runs what) we will cover over the next several days, because there is way too much to fit in one post.

The three tiers, in plain terms

Almost every way you can run a chat or RP model today falls into one of three tiers. The names vary, the lines blur a little at the edges, but the shape is consistent.

Tier 1: App-bundled. You sign up for an app (Character.ai, Janitor, Chub, c.ai, JanitorAI's bundled tier, etc.), pay a subscription or use the free version, and chat. The app picks the model, hosts the inference, writes the system prompt, sets the content rules, and bills you. You do not see the model, you do not pick the model, you do not control the prompt below your character card.

Tier 2: BYOK (Bring Your Own Key). You sign up for an API provider (OpenAI, Anthropic, DeepSeek, Google, OpenRouter, Mistral, etc.) and get an API key. You paste that key into the chat app of your choice (SillyTavern, Chub's Mercury, KoboldAI Lite, JanitorAI's BYOK mode, etc.). The app still handles the UI and the prompt building, but the actual model calls go through your key, billed to your account, on whatever provider you chose. The model is still running on someone else's servers, you are just paying for it directly.

Tier 3: Local. The model weights live on your machine. You download them, you load them into a runtime (LM Studio, ollama, koboldcpp, llama.cpp, text-generation-webui, etc.), and inference happens on your own CPU/GPU. No network call to anyone. No per-token bill. The only ongoing cost is electricity, and the upfront cost is whatever hardware you needed to fit the model.

Most people start at Tier 1, get frustrated, move to Tier 2, and then a meaningful fraction eventually some drift toward Tier 3 once they have the hardware or the patience for it.

What "BYOK" actually changes

The word "BYOK" gets thrown around a lot, but the practical changes when you move from a bundled app to BYOK are pretty specific.

Billing flips from flat to per-token. Instead of $X/month for unlimited (or rate-limited) chat, you pay per million tokens, in and out. For most RP usage this is cheaper if you talk casually and more expensive if you run 32k-context sessions all day on a premium model. DeepSeek V3 and the cheaper Gemini tiers are pennies per long session. GPT-5 or Claude 4.7 on long contexts adds up faster.

You pick the model. This is the big one. Bundled apps usually offer one or two models, sometimes a "premium" toggle. With BYOK you can swap between DeepSeek V3.2, Claude 4.7, GPT-5, Gemini 3, Kimi K2.5, GLM-4.7, Mistral Large, and dozens of fine-tunes hosted on OpenRouter (Or other platforms), all from the same chat UI. Different models suit different scenes, and being able to switch mid-chat is genuinely useful.

Content moderation depends on the provider, not the app. This is the most misunderstood part. The chat app's ToS no longer governs what you can write, because the app is not hosting the inference. Whatever rules the API provider has are what apply. OpenAI and Anthropic have strict usage policies and will rate-limit or ban accounts for sustained policy violations. DeepSeek, Mistral, and most open-weights-via-API providers are far more permissive in practice. OpenRouter sits in the middle and depends on which underlying model you route to. None of this means "no rules at all." It means the rules move from "the app I signed up with" to "the lab whose model I'm using."

One thing worth noting: some BYOK frontends add their own filter on top, and you cannot turn it off. A handful of hosted "BYOK" wrappers screen prompts on the way out even though you are paying the provider directly. If you specifically went BYOK to escape app-level filtering, check the app actually passes your prompt through cleanly before assuming it does. The self-hosted frontends (SillyTavern, KoboldAI Lite) do not add filtering of their own; some browser-based ones do.

Privacy shifts, but does not vanish. Your prompts and replies are now visible to the API provider you chose, not the app you typed them into. For most providers, business-tier API traffic is not used for training and is retained only briefly, but each provider has different defaults, and "the API provider can see your chats" is still true. BYOK is not the same as local.

Also worth being explicit about: some providers will train on your prompts even when you are paying for the API, unless you opt out (and sometimes there is no opt-out on the cheaper tiers). The cheaper Chinese provider tiers and Google's free Gemini tier are the well-known examples; OpenAI and Anthropic's standard API tiers do not train on your traffic by default, but their consumer chat products do. Read the privacy policy of the specific tier you signed up for. "I paid for it, so they can't use it" is not a safe assumption.

Rate limits and reliability change. Bundled apps queue you behind everyone else on their plan. BYOK gives you your own quota with the provider, which is usually much higher and much more reliable for heavy users. The flip side is that if the provider has an outage, your chat is down until they fix it.

You become the integrator. When something breaks, the chat app blames the API and the API blames the app. There is no single support line. For most people this is a minor annoyance. For some it is the reason they bounce back to a bundled app.

What "local" actually changes

Tier 3 is a bigger jump than Tier 2, and the tradeoffs are different in kind, not just in degree.

Per-token cost goes to zero. Once the model is on your disk, you can generate as many tokens as you want for the price of electricity. No quota, no surprise bill, no provider rate limit. For heavy users this is the single biggest reason to go local.

Privacy goes to actually-private. Not "the provider promises not to train on it." The prompt never leaves your machine. For RP content specifically, this matters to a lot of people. It is also why a meaningful chunk of this subreddit ended up local: the content policy debate stops being relevant when the content does not leave the room.

No content rules, period. This is the part a lot of people care about most and the part nobody else can offer. There is no ToS, no provider filter, no app-level moderation. You can run uncensored fine-tunes, abliterated models (day 5 callback: the safety layer surgically removed from the weights), or community RP-tuned models that would get your account flagged on a hosted API. For adult RP specifically, this is the only tier where the question "is this allowed?" simply does not exist. You picked the model, you run the model, you decide what it does. The flip side is that there is also nobody stopping you from generating things you would regret, so the responsibility is fully yours; but for consenting-adult RP between you and your own machine, local is the only place where the answer is unconditionally yes.

Hardware becomes the bottleneck. This is the catch. Modern strong models are large. Frontier-quality models (DeepSeek V3.2, Kimi K2.5, GLM-4.7) are hundreds of billions of parameters and you are not running those on a gaming PC. What you can run locally is the 7B–70B range, with the 12B–32B sweet spot being where most local users live. These are not frontier models, but they are surprisingly capable, especially for RP, and they have closed a lot of the gap over the last year.

Setup is a real one-time cost. Picking a runtime, picking a model, picking a quantisation, getting it to load without OOMing, configuring context length, plugging it into SillyTavern or your UI of choice: this is an evening of work the first time, and ten minutes every time after that. Not hard, but not zero either. We will spend most of next week walking through this part.

Quality ceiling is lower, but not by as much as people think. A 32B local model is not Claude Opus 4.7. It is also not the 2023 disaster people remember. Modern local models in the 24B–70B range, especially the newer Qwen3.5 and Gemma 4 based fine-tunes and the Mistral Small 3 family, are good enough that a lot of users genuinely prefer them for RP over premium APIs. The difference is biggest on long, complex reasoning tasks and smallest on character-driven dialogue.

The middle road: OpenRouter

Worth its own short section, because OpenRouter is the on-ramp a lot of people use between Tier 1 and full BYOK.

OpenRouter is a single API that proxies to dozens of underlying providers. You get one API key, one bill, and access to OpenAI, Anthropic, Google, DeepSeek, Mistral, xAI, Cohere, plus a long tail of open-weights models hosted on Together, Fireworks, DeepInfra, Novita, and others. Models you would otherwise need five separate accounts to use are all one dropdown away.

Why people start there:

One key, many models. You can switch from Claude 4.7 to DeepSeek V3.2 to a Qwen3 fine-tune without re-pasting API keys.
Pay-as-you-go without provider minimums. Some providers (Anthropic in particular) have business-account friction. OpenRouter is a credit-card top-up away.
Free tier on a rotating selection of models. OpenRouter has historically offered a free tier for some open-weights models with daily rate limits. Quality and availability fluctuate, but for "try BYOK before committing" this is the cheapest possible entry point.
Decent moderation posture. OpenRouter itself is fairly hands-off; the underlying provider's rules still apply, but OpenRouter is not adding its own filter on top of most models.

Why people eventually leave OpenRouter for direct provider keys:

Small markup. OpenRouter takes a cut. For heavy users, going direct to the provider is a few percent cheaper.
Latency. Routing through a proxy adds a small amount of overhead. Usually negligible, occasionally not.
Provider-specific features. Some provider features (Anthropic's prompt caching, OpenAI's reasoning effort dial, DeepSeek's prefix caching) work better or only work when you call the provider directly.

For most people in this community, OpenRouter is the right starting point for BYOK, and the right place to stay unless you are heavy enough on one specific model to justify a direct account.

The tradeoff at a glance

Rough shape of the three tiers, to give you something to anchor on:

Bundled app. Cheapest to start, easiest to use, locked model selection, app-controlled content rules, your chats live on the app's servers. Good for "I just want to chat, I don't want to think about any of this." Bad if you care about model quality, privacy, or content freedom.
BYOK. Moderate cost (scales with use), one-time setup, full model selection, provider-controlled content rules, your chats live on the provider's servers. Good for "I want frontier-model quality without the bundled app's restrictions." The current sweet spot for most active users in this community.
Local. Highest upfront cost (hardware), free per token, full model selection within hardware limits, no content rules at all, your chats never leave your machine. Good for "I run RP often enough that hardware pays for itself, and I care about privacy." Bad if your hardware is weak or you specifically need frontier-tier quality.

Cost shape, simplified:

Bundled: $X/month, flat.
BYOK: pennies per casual session, dollars per heavy session, scales linearly with use.
Local: hardware capex up front ($300 to $3000+ depending on what you want to run), then near-zero ongoing.

If you only chat a few hours a week, the bundled app is usually still the cheapest option in absolute terms, and there is nothing wrong with staying there. If you chat heavily, BYOK becomes cheaper than the bundled options surprisingly quickly. If you chat very heavily, local pays for the hardware in months.

Why the community is migrating

Three things stacked on top of each other over the past year.

Bundled apps tightened their content rules. Character.ai went through a well-documented filter tightening in 2025 and never really walked it back. Janitor, Chub, c.ai variants, and most of the bundled players have followed similar patterns: stricter moderation, more refusals, more silent quality degradation on the cheap tiers, more "premium" features behind higher subs. For RP specifically, this hit hard, because RP often touches content that mainstream chat apps want to keep at arm's length.

Open-weights models got genuinely good. This is the day-7 callback. DeepSeek R1 in January 2025 was the moment open-weights stopped being "a worse free version of GPT" and became "a real option." Since then, DeepSeek V3.2, Qwen3.5, GLM-4.7, Kimi K2.5, and the Mistral Large/Small 3 line have closed enough of the gap that for a lot of use cases (RP very much included) the open option is the better one, not just the cheaper one. BYOK gave people direct access to those models. Local gave people unlimited access.

Per-token economics started favouring BYOK. DeepSeek V3 and Gemini Flash dropped per-token prices to the point where a heavy RP user can run thousands of messages a month for less than the cost of a premium chat-app subscription. Once the economics flipped, the only reason to stay bundled was convenience, and convenience is a one-evening setup problem.

Stack those three together and the migration looks less like a trend and more like a market correction. Bundled apps were the only option in 2023. They are now one option of three, and for power users, often the least attractive of the three.

Where this leaves you

If you are reading this post, you are probably already somewhere in the middle of this migration. Most likely cases:

Still on a bundled app, curious about BYOK. Pick an OpenRouter account, drop $10 in credit, paste the key into SillyTavern or your UI of choice, and try DeepSeek V3.2 or GLM 5 for a week. The "is BYOK worth it" question answers itself in about three sessions.
Already on BYOK, wondering if local is worth it. Depends entirely on hardware. If you have a 12GB+ GPU, the answer is "probably yes, at least to try." If you have an 8GB GPU, the answer is "for small models, maybe." If you have a CPU and 16GB of RAM, the answer is "for tiny models with low expectations, sure." We will cover this properly in the VRAM math post.
Already local, looking for the next tier. Bigger models, better quantisations, faster runtimes, or multi-GPU setups. The deep-dive posts coming up are for you.

Tomorrow (day 9) we will get into the practical side of local LLMs: what actually runs where, the runtimes people use (llama.cpp, ollama, LM Studio, koboldcpp, vLLM, and a few less common ones), the differences between them, and how to pick. After that we will tackle VRAM math, quantisation and GGUF, sampling settings for local models, and finally the hardware question (which GPU, how much RAM, when does CPU inference make sense).

That's all for today. I hope this helps!

AI Basics Day 8: What does BYOK and "running your own model" actually mean, and why is half the community migrating that way?