u/Anbeeld

I made a ruleset to turn ChatGPT, Claude, Gemini into a CV writer that interviews you

Me and my friends hate writing CVs. You open a doc, stare at it, list responsibilities instead of achievements, and it just doesn't sound right. And AI only made it worse at first, making you a "dynamic team player" just like everyone else is.

So I wrote a ruleset. Not a template, but instructions you give to ChatGPT, Claude, Gemini, whatever, telling it to follow every rule strictly.

It interviews you one question at a time instead of asking you to dump your whole career at once. You can start from nothing and it walks you through. If you already have a CV or a LinkedIn profile, you paste it in and it locks the facts it finds, then asks only for what's missing.

What actually makes the CVs better:

  • It won't draft until it has real evidence and a positioning decision, not just a job title, so the bullets carry weight
  • If you can't remember exact numbers, it walks you down a ladder from direct outcomes to qualitative anchors instead of letting "I did the work" stand as a bullet
  • Market conventions are built in for many regions so you're not guessing whether a photo belongs or what personal info to include
  • Every draft self-audits before you see it, including a red-flag search that strips weasel verbs, generic phrases, and leaked process language
  • Each revision you request gets sharper without losing what was already right

I'm not in job search myself right now (although I tested it on myself too), but a few people I know are, and they say it made their CVs stronger than what they could write themselves. But also the process is so much less painful, because you're just answering a number of questions instead of writing an entire document with all the details from scratch.

It's completely free on GitHub: github.com/Anbeeld/RESUME.md. I'm sharing it with the world because I feel it might help someone, and paid SaaS services are not always a solution when you don't have a job. Would be interested in hearing your feedback!

reddit.com
u/Anbeeld — 1 day ago

I made a ruleset to turn ChatGPT, Claude, Gemini into a CV writer that interviews you

Me and my friends hate writing CVs. You open a doc, stare at it, list responsibilities instead of achievements, and it just doesn't sound right. And AI only made it worse at first, making you a "dynamic team player" just like everyone else is.

So I wrote a ruleset. Not a template, but instructions you give to ChatGPT, Claude, Gemini, whatever, telling it to follow every rule strictly.

It interviews you one question at a time instead of asking you to dump your whole career at once. You can start from nothing and it walks you through. If you already have a CV or a LinkedIn profile, you paste it in and it locks the facts it finds, then asks only for what's missing.

What actually makes the CVs better:

  • It won't draft until it has real evidence and a positioning decision, not just a job title, so the bullets carry weight
  • If you can't remember exact numbers, it walks you down a ladder from direct outcomes to qualitative anchors instead of letting "I did the work" stand as a bullet
  • Market conventions are built in for many regions so you're not guessing whether a photo belongs or what personal info to include
  • Every draft self-audits before you see it, including a red-flag search that strips weasel verbs, generic phrases, and leaked process language
  • Each revision you request gets sharper without losing what was already right

I'm not in job search myself right now (although I tested it on myself too), but a few people I know are, and they say it made their CVs stronger than what they could write themselves. But also the process is so much less painful, because you're just answering a number of questions instead of writing an entire document with all the details from scratch.

It's completely free on GitHub: github.com/Anbeeld/RESUME.md. I'm sharing it with the world because I feel it might help someone, and paid SaaS services are not always a solution when you don't have a job. Would be interested in hearing your feedback!

reddit.com
u/Anbeeld — 1 day ago

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

TL;DR New llama.cpp fork! I wanted a Windows-friendly inference to run Qwen 3.6 27B Q5 on a single RTX 3090 with speculative decoding, high context without excess quantization, and vision enabled. No option did this out of the box for me without VRAM and/or tooling issues (this was before MTP PR for llama.cpp surfaced there).

So I pulled out an old trick: stay up to 4 a.m. one too many times to do month+ work in a week or two. I probably lost a decent amount of hair while trying to make this all work, but now I have what seems to be a proper solution and don't mind to share.

Anbeeld's BeeLlama.cpp

https://preview.redd.it/o92fxb2ox40h1.jpg?width=1800&format=pjpg&auto=webp&s=70958157a8e28a2fdbbda5b671696648e323beda

GitHub repo: https://github.com/Anbeeld/beellama.cpp

BeeLlama.cpp (or just Bee) is a performance-focused llama.cpp fork for squeezing more speed and context out of local GGUF inference. It keeps the familiar llama.cpp tools and server flow, then adds DFlash speculative decoding, adaptive draft control, TurboQuant/TCQ KV-cache compression, and reasoning-loop protection, with full multimodal support.

>Not quite a pegasus, but close enough.

Here's a plug-and-play Qwen 3.6 27B setup with a config to run it in Q5 + 200k of practically lossless KV cache + vision on a single RTX 3090 or 4090.

Fork Features

  • DFlash speculative decoding: --spec-type dflash drives a DFlash draft GGUF alongside the target model. The target captures hidden states into a per-layer 4096-slot ring buffer, the drafter cross-attends to the most recent --spec-dflash-cross-ctx hidden-state tokens and proposes drafts for target verification.
  • TurboQuant / TCQ KV-cache compression: Five cache types (turbo2, turbo3, turbo4, turbo2_tcq, turbo3_tcq) spanning from 4x to 7.5x compression, with higher-bit options being practically lossless in many cases. Set independently with --cache-type-k and --cache-type-v.
  • Adaptive draft-max control: The server adjusts the active draft horizon at runtime instead of using a fixed --spec-draft-n-max. The default profit controller compares speculative throughput against a no-spec baseline; the fringe alternative maps acceptance-rate bands to draft depth.
  • Full multimodal support: When --mmproj is active, the server keeps flat DFlash available for text generation. The model can be fully offloaded to CPU with no problems to reduce VRAM pressure.
  • Reasoning-loop protection: The server detects repeated hidden reasoning output and intervenes. Default mode is force-close with --reasoning-loop-window and --reasoning-loop-max-period tuning available.
  • Sampled DFlash verification: --spec-draft-temp enables rejection-sampling drafter behavior. Activates when both draft and target temperature exceed zero. Draft log probabilities must be available for rejection sampling to produce correct output.
  • DDTree branch verification: optional --spec-branch-budget adds branch nodes beyond the main draft path with GPU parent_ids, tree masks, and recurrent tree kernels. Disabled automatically when the target model spans more than one GPU. This one is very much work in progress!
  • Request-level speculative overrides: Draft-max and branch budget can be overridden per-request through JSON fields without restarting the server.
  • CopySpec model-free speculation: --spec-type copyspec provides rolling-hash suffix matching over previous tokens without a draft model.

For the full feature and public-repo comparison, read docs/beellama-features.md. For the complete argument reference, read docs/beellama-args.md.

TurboQuant (WHT-based scalar quantization) originates from TheTom/llama-cpp-turboquant. TCQ (Trellis-Coded Quantization) and basic DFlash implementation originate from spiritbuun/buun-llama-cpp (paper: Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits).

reddit.com
u/Anbeeld — 4 days ago

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

TL;DR New llama.cpp fork! I wanted a Windows-friendly inference to run Qwen 3.6 27B Q5 on a single RTX 3090 with speculative decoding, high context without excess quantization, and vision enabled. No option did this out of the box for me without VRAM and/or tooling issues (this was before MTP PR for llama.cpp surfaced there).

So I pulled out an old trick: stay up to 4 a.m. one too many times to do month+ work in a week or two. I probably lost a decent amount of hair while trying to make this all work, but now I have what seems to be a proper solution and don't mind to share.

Anbeeld's BeeLlama.cpp

https://preview.redd.it/lqjgiw1bx40h1.jpg?width=1800&format=pjpg&auto=webp&s=3b68c16e78d36a1089a14f31b338aa78b8a1c073

GitHub repo: https://github.com/Anbeeld/beellama.cpp

BeeLlama.cpp (or just Bee) is a performance-focused llama.cpp fork for squeezing more speed and context out of local GGUF inference. It keeps the familiar llama.cpp tools and server flow, then adds DFlash speculative decoding, adaptive draft control, TurboQuant/TCQ KV-cache compression, and reasoning-loop protection, with full multimodal support.

>Not quite a pegasus, but close enough.

Here's a plug-and-play Qwen 3.6 27B setup with a config to run it in Q5 + 200k of practically lossless KV cache + vision on a single RTX 3090 or 4090.

Fork Features

  • DFlash speculative decoding: --spec-type dflash drives a DFlash draft GGUF alongside the target model. The target captures hidden states into a per-layer 4096-slot ring buffer, the drafter cross-attends to the most recent --spec-dflash-cross-ctx hidden-state tokens and proposes drafts for target verification.
  • TurboQuant / TCQ KV-cache compression: Five cache types (turbo2, turbo3, turbo4, turbo2_tcq, turbo3_tcq) spanning from 4x to 7.5x compression, with higher-bit options being practically lossless in many cases. Set independently with --cache-type-k and --cache-type-v.
  • Adaptive draft-max control: The server adjusts the active draft horizon at runtime instead of using a fixed --spec-draft-n-max. The default profit controller compares speculative throughput against a no-spec baseline; the fringe alternative maps acceptance-rate bands to draft depth.
  • Full multimodal support: When --mmproj is active, the server keeps flat DFlash available for text generation. The model can be fully offloaded to CPU with no problems to reduce VRAM pressure.
  • Reasoning-loop protection: The server detects repeated hidden reasoning output and intervenes. Default mode is force-close with --reasoning-loop-window and --reasoning-loop-max-period tuning available.
  • Sampled DFlash verification: --spec-draft-temp enables rejection-sampling drafter behavior. Activates when both draft and target temperature exceed zero. Draft log probabilities must be available for rejection sampling to produce correct output.
  • DDTree branch verification: optional --spec-branch-budget adds branch nodes beyond the main draft path with GPU parent_ids, tree masks, and recurrent tree kernels. Disabled automatically when the target model spans more than one GPU. This one is very much work in progress!
  • Request-level speculative overrides: Draft-max and branch budget can be overridden per-request through JSON fields without restarting the server.
  • CopySpec model-free speculation: --spec-type copyspec provides rolling-hash suffix matching over previous tokens without a draft model.

For the full feature and public-repo comparison, read docs/beellama-features.md. For the complete argument reference, read docs/beellama-args.md.

TurboQuant (WHT-based scalar quantization) originates from TheTom/llama-cpp-turboquant. TCQ (Trellis-Coded Quantization) and basic DFlash implementation originate from spiritbuun/buun-llama-cpp (paper: Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits).

reddit.com
u/Anbeeld — 4 days ago
▲ 1.7k r/paradoxplaza+7 crossposts

Longread: how I built 3 massive AI mods for Paradox grand strategy games (Stellaris, Victoria 3, Imperator: Rome) in a scripting language that doesn't even have arrays

I'm the author of multiple AI mods for Paradox grand strategy games: Anbeeld's Revision of AI for Victoria 3, the AI in Imperator: Invictus, and my old personal Stellaris AI mod. I'm not modding much these days, but I wanted the design knowledge to live on. ARoAI never had proper documentation, for instance.

The article covers utility systems, blackboards, planners, and what happens when the scripting language can't express any of them. It goes through how each mod approximated standard game AI architectures, what each gave up, and what actually worked.

anbeeld.com
u/Anbeeld — 6 days ago

I struggle to wrap my head around all this. My goal is local agent to solve low complexity tasks, in the same harness where I would use frontier models. So naturally this means a large context window, because low complexity can mean a simple-ish fix in a large codebase, rather than just generating some nonsense from zero.

So initially I went for Tom's turboquant plus fork of llama.cpp (I'm on Windows) with Qwen 3.6 Q4 and IQ4 models and 200k context window. Well it worked, it can read the entirety of example project I gave to it and make an audit (as much as it's capable of making it). But deep into context window the speed is just sad, like 10-11 tps, or even lower?

So I went into a rabbit hole of all the posts there all saying they have 85-100 tps on a single 3090 with a 5 billion context window or so. I've tried WSL2+vLLM with MTP and Genesis patches. Well it works in a sense that it launches but I'm OOM at any adequate context window and also it seems like there are tool issues and whatnot.

I've tried Luce DFlash solution and it turned out they didn't even have a working server solution. I made 2 PRs into it that fixed huge VRAM issues but then it turned out it doesn't format thinking right and can't use tools whatsoever. Oh well. Was fast in the "hi" chat at least.

Now I'm trying some other llama.cpp forks and modifying them to fix obvious issues they have, but at this point I have to question it all.

What's your tps on 3090 + Qwen 3.6 27B in real tasks? Like real coding tasks with many thousands of context, in proper harness? From what I read all these technologies like MTP and DFlash degrade very very fast with context as predicting correctly becomes very hard as the prediction model only sees a small part of the context at any time. Is that right?

But I also see people claiming they maintain like 30 tps on long chats. The "chats" is key there. All these benchmarks illustrate numbers based on feeding a model one prompt. Which is so so so much faster than multi-step chats. But in real agentic usage you often need this back-and-forth feedback.

And yes I do need thinking, it's crucial for coding tasks, but seems like it ruins prediction systems speed even further?

So tell me, is it skill issue or it really isn't as simple as these posts make it seem to be?

reddit.com
u/Anbeeld — 11 days ago
▲ 19 r/ChatGPTPro+3 crossposts

I write with AI quite a bit, and I kept hitting the same wall: the text was technically fine, but you could tell. The polished hedging, the em dashes piling up in every paragraph, paragraphs you could swap and nobody would notice.

So I wrote down the rules I wanted the model to follow. They target the patterns that make generated text recognizable: filler, false specificity, repeated cadence, structure that's too neat. No fake typos or injecting slang. Prompt-level instructions have a ceiling, but the output comes out noticeably better than before.

A few of the rules that do the most work:

  1. Concrete over polished. Every paragraph needs at least one anchor you could check: a proper noun, a specific number, a direct quote, a named decision. "Various," "meaningful changes," and "broad implications" don't count. If the most concrete thing in a paragraph is a name and a date, it's probably still too generic.
  2. Plain words. Don't chase synonyms for basic words like problem, change, system. Repeat the ordinary word when it's the right one. "We changed it" beats "the implementation of the change." If you keep reaching for "furthermore", "moreover", or "additionally", use pronouns instead.
  3. Don't perform. No keynote cadence. No mission-statement phrasing. No applause-line endings. No service-desk tone: "Great question," "I hope this helps," "Feel free to reach out." Start where the answer starts. Stop where it stops.
  4. Watch regularity. The most visible feature of LLM writing is often its own regularity. Same punctuation move every paragraph. Three-part cadence. "Not X, but Y" rhythm. Paragraph-closing type definitions like "the kind of X where Y." Identical paragraph arcs. Break the pattern where it dominates, don't just mask it with random variation.
  5. Show concrete before generalizing. Don't lead with abstract diagnosis when the reader has nothing concrete to attach it to. The order should usually be: what happened, where it appeared, what constraint mattered, what failed, what that seems to mean.
  6. Revise by cutting. Re-read as a first-time reader. Sentences auditioning for attention can go. So can sentences whose only job is announcing the next one. Collapse paragraphs that restate each other. Replace the most generic clause with something specific, or delete it. Most edits should make the text shorter.
  7. Fit format to medium. Over-structuring casual writing makes it templated. Under-structuring technical writing makes it unusable. Don't strip useful headings or lists from docs just to look less AI-written.

The full ruleset, a harness skill, a compact version (~1000 words, for agent instructions and custom GPTs), and a mini version (~155 words, drops into AGENTS.md or CLAUDE.md) are in the repo: github.com/Anbeeld/WRITING.md

I also made global coding agent instructions (AGENTS.md / CLAUDE.md): evidence before code, small scoped changes, real verification, parallelization. github.com/Anbeeld/AGENTS.md

u/Anbeeld — 1 day ago
▲ 190 r/copywriting+10 crossposts

I use coding agents a lot, and write with LLMs enough that the same issues kept showing up. Agents would jump into code before they understood the repo, touch adjacent code I did not ask for, and say something was done without really verifying it. And text is a separate big problem, as you all know: too polished, too generic, too much AI slop even when the actual point was fine.

So I started writing down the rules I wished the agents followed, then tightened them whenever I saw the same failure happen again. Eventually that turned into two small repos I use myself:

  • AGENTS.md / CLAUDE.md — global instructions for coding agents. Evidence before code. Small scoped changes. Real verification. Better use of parallel work and subagents instead of one-step-at-a-time.
  • WRITING.md — a ruleset for cutting the patterns that make LLM text feel pasted from a chatbot: filler, fake specificity, over-neat structure, repeated cadence, and the rest. It comes in three versions: the full ruleset (~3900 words), a compact version (~1000 words) for agent instructions and custom chats like GPTs and Gemini Gems, and a mini version (~155 words) as a section in any AGENTS.md or CLAUDE.md file.

Both are public now. Use them as-is, borrow parts, disagree with the rules, or open an issue if something works differently in your setup. They solved some of the problems for me, and I'm curious what holds up for other people.

u/Anbeeld — 14 days ago