r/cpp
Virtual dispatch isn't always the slowest, and std::variant isn't always the fastest
I've been looking at how OpenJDK's GC barrier system picks its implementation at runtime using templates instead of virtual dispatch. The trick is lazy resolution: you pay once at first use instead of a vtable lookup on every call.
That got me curious enough to benchmark it against three other approaches: virtual functions, function pointers, and std::variant + std::visit. I was surprised to see std::variant being the slowest on libstdc++ while virtual dispatch beat it comfortably.
Please refer to the blog for my full analysis. Would love to hear what you think!
Edit: Benchmarks are on GCC 11 (Ubuntu 22.04 default). GCC 12+ significantly improves std::visit. Full compiler version comparison in the next post.
Announcing iceoryx2 v0.9: Fast and Robust Inter-Process Communication (IPC) Library
ekxide.ioKiln - A CMake-compatible build system that can do what CMake can't
clehaxze.tw7 New Projects Made in Unigine, Using C language
New projects and games made in Unigine. C is a primary form of coding in Unigine
citor: a header-only C++20 thread pool tuned for sub-µs dispatch
I just released citor, a small header-only C++20 thread pool / parallel runtime aimed at CPU-bound workloads where per-dispatch latency actually shows up in the profile.
Repo: https://github.com/Lallapallooza/citor
The main idea is: keep the common CPU-parallel shapes in one pool, avoid per-call allocations on the hot path, let the producer participate as slot 0, and make short repeated phases cheaper than repeatedly waking a worker team.
The simplest thing looks like what you'd expect:
citor::ThreadPool pool(8);
pool.parallelFor<citor::HintsDefaults>(
0, data.size(),
[&](std::size_t lo, std::size_t hi) {
for (std::size_t i = lo; i < hi; ++i)
data[i] *= 2;
});
Beyond parallelFor, it has deterministic parallelReduce, parallelScan, parallelChain, runPlex for repeated phases over the same partition, recursive forkJoin with per-worker Chase-Lev deques, bulkForQueries, and submitDetached. There is also a PoolGroup that creates one arena per shared-L3 group, mostly useful on multi-CCD Zen.
A few internals that ended up mattering more than I expected:
- each worker owns a cache-line-aligned mailbox and the whole dispatch protocol is a per-slot mailbox stamp, no shared queue
- the producer can short-circuit small jobs by CAS-ing the worker's mailbox to DONE itself and running the body inline, no wake at all (worker's own ack races the producer's self-stamp, loser short-circuits);
- the join barrier is a per-slot done-epoch scan with cancellation riding the same epoch read, so no shared sense bit and no per-iteration cancel poll
- the worker's spin-entry
rdtscpdoubles as a store-buffer drain, so the producer sees the DONE stamp before its next mailbox read - free side benefit of timing the spin kCacheLineis 128 bytes rather than 64 because Zen prefetches in cache-line pairs and contended atomics get measurably worse if you size to 64.
For perf, I wrote a comparative harness against BS::thread_pool, dp::thread_pool, task-thread-pool, riften, oneTBB, Taskflow, Eigen, OpenMP, Leopard, dispenso, libfork, and TooManyCooks. Competitor revisions are pinned, host gates are printed at startup, OpenMP wait policy is normalized, and raw samples can be exported as JSON.
In my current benchmark sweep, citor wins roughly:
- 92% of contested cells on a Ryzen 9950X3D
- 75% on a 96-core Genoa box
- 69% on a 48-core Sapphire Rapids box
Hot fan-out dispatch on the 9950X3D is usually in the 100-400 ns range depending on participant count and shape.
Please treat those as "my harness on my machines or aws," not universal truth. If the numbers matter to your use case, run the benchmark yourself. The README has the methodology and reproduction commands.
There is real work left:
- topology detection is still shaped mostly around Zen CCDs
- multi-socket EPYC, sub-NUMA clustering, hybrid P/E cores, and Intel mesh are not first-class yet
- parallelReduce uses static contiguous chunks and does not steal after a worker finishes, so heavy-tail bodies can leave cores idle
- the coroutine wrapper queues on a per-pool driver thread rather than doing continuation stealing
- bulkForQueries only fans across queries today a true 2D fan is probably the next useful shape.
What citor is not:
- not an I/O executor
- not a general async/future abstraction
- not a TBB or OpenMP replacement for arbitrary workloads
- not tuned equally for every CPU topology
I'd especially like feedback on benchmark fairness, API shape before 1.0, missing competitors, and whether the affinity / pinning behavior is too surprising for a library like this and for sure any perf improvenments suggestions. If anything in the README reads like overclaiming, I'd rather fix it now.
Boost 1.91.0 is now available in both Conan and vcpkg
For those of you waiting to upgrade through your package manager, Boost 1.91.0 has landed in both Conan and vcpkg.
What's in 1.91:
- Boost.Decimal — new library implementing IEEE 754 decimal floating point arithmetic (from Matt Borland and Christopher Kormanyos)
- Asio binary versioning — optional inline namespace lets multiple Asio versions coexist in the same process without symbol conflicts
- 58 fewer internal dependencies across 55 libraries
- StaticAssert merged into Config — no code changes needed, just update your dependency declarations when ready
- CMake
import stddetection fix
Install:
conan install --requires=boost/1.91.0
vcpkg install boost
Links:
Clang Lifetime Safty Doc Update
Intro:
Clang Lifetime Safety Analysis is a C++ language extension which warns about potential dangling pointer defects in code. The analysis aims to detect when a pointer, reference or view type (such as std::string_view) refers to an object that is no longer alive, a condition that leads to use-after-free bugs and security vulnerabilities. Common examples include pointers to stack variables that have gone out of scope, pointers to heap objects that have been freed, fields holding views to stack-allocated objects (dangling-field), returning pointers/references to stack variables (return stack address) or iterators into container elements invalidated by container operations (e.g., std::vector::push_back)
The analysis design is inspired by Polonius, the Rust borrow checker, but adapted to C++ idioms and constraints, such as the lack of exclusivity enforcement (alias-xor-mutability). Further details on the analysis method can be found in the RFC on Discourse.
This is compile-time analysis; there is no run-time overhead. It tracks pointer validity through intra-procedural data-flow analysis. While it does not require lifetime annotations to get started, in their absence, the analysis treats function calls optimistically, assuming no lifetime effects, thereby potentially missing dangling pointer issues. As more functions are annotated with attributes like clang::lifetimebound, gsl::Owner, and gsl::Pointer, the analysis can see through these lifetime contracts and enforce lifetime safety at call sites with higher accuracy. This approach supports gradual adoption in existing codebases.
Managing context limits in large C++20 Module codebases with MCP (Case Study & Tool)
Hi everyone,
Working with LLMs on modern C++ codebases usually hits a wall very quickly: context windows get flooded with massive files, and most standard indexers still struggle with C++20 Module partitions and imports.
We are currently running a live development workflow on a large-scale commercial project consisting of over 7,000 source files, mostly utilizing C++20 modules.
We managed to establish a highly performant workflow using the Codex App on the desktop, combined with VS MCP, IDAP MCP, and a dedicated lightweight tool we created to bridge the C++ gap: mcp-cpp-project-indexer.
The Problem We Solved
Standard file-dumping or naive regex indexing either sends thousands of lines of irrelevant code to the LLM (costly and slow) or completely loses track of C++20 module dependencies.
Instead of trying to replace a full compiler/LSP (like clangd) or performing heavy semantic analysis, our indexer acts purely as a stream- and token-based locator. It maps out files, symbols, and module structures, providing the LLM with exact line references (startLine/endLine).
The Setup & Results
- The Stack: Codex App + VS MCP + IDAP MCP +
mcp-cpp-project-indexer. - Token Reduction: The LLM only requests and reads the exact code fragments it actually needs. This reduces the text sent to the LLM by up to 86%.
- Performance: Written in Python, it includes a file watcher mode that calculates hashes incrementally. It stays up-to-date in real-time during active development without hammering the CPU.
- Intelligence: Code/ChatGPT confirmed that the context routing works flawlessly even at this 7,000-file scale.
Why share this?
When we started, we couldn't find a lightweight, production-ready way to make Claude/GPT understand a massive C++20 module graph without spending a fortune on API tokens or waiting ages for context processing. This setup proved that the Model Context Protocol (MCP) is absolutely ready for large enterprise codebases if decoupled correctly.
The project is fully open-source. If you are struggling with C++ context limits or modules in your AI workflow, feel free to check it out, spin it up, or contribute:
👉 GitHub: github.com
I’m happy to answer any questions about how we configured the MCP synergy or how the incremental indexing handles the C++20 module tree!
How I Set Up VS Code for Competitive Programming / DSA in C++ (Windows)
After trying many setups, I finally created a clean VS Code environment for C++ + Competitive Programming.
Now I have:
- Separate folders for source code and
.exe - One-click compile & run
input.txtandoutput.txt- GitHub integration
- Clean project structure
This setup feels much better than online IDEs for serious DSA practice.
1. Install VS Code
Download:
https://code.visualstudio.com/
Install normally.
2. Install MinGW (g++ compiler)
Download MinGW:
https://sourceforge.net/projects/mingw/
While installing select:
mingw32-gcc-g++
After installation, your compiler path usually becomes:
C:\MinGW\bin
3. Add MinGW to PATH
Search:
Environment Variables
Open:
Edit the system environment variables
Then:
Environment Variables
→ Path
→ Edit
→ New
Add:
C:\MinGW\bin
Click OK everywhere.
Restart VS Code completely.
4. Verify Compiler
Open terminal in VS Code:
Ctrl + `
Run:
g++ --version
If installed correctly, it will show GCC version.
5. Install VS Code Extensions
Install these:
- C/C++
- Code Runner (optional)
6. Create Folder Structure
I use this structure:
DSA/
│
├── code/
│
├── exe/
│
├── input.txt
├── output.txt
│
├── .vscode/
│ └── tasks.json
│
├── .gitignore
└── README.md
This keeps everything clean.
7. Setup Input / Output Redirection
Inside every CPP file:
freopen("input.txt", "r", stdin);
freopen("output.txt", "w", stdout);
Now:
- input comes from
input.txt - output goes to
output.txt
No need to type input repeatedly in terminal.
8. Create tasks.json for One-Key Run
Create:
.vscode/tasks.json
Paste this:
{
"version": "2.0.0",
"tasks": [
{
"label": "Run C++",
"type": "shell",
"command": "cmd",
"args": [
"/c",
"g++ \"${file}\" -o exe\\${fileBasenameNoExtension}.exe && exe\\${fileBasenameNoExtension}.exe"
],
"group": {
"kind": "build",
"isDefault": true
},
"presentation": {
"reveal": "always",
"panel": "shared"
},
"problemMatcher": []
}
]
}
Now press:
Ctrl + Shift + B
and current file automatically:
- compiles
- generates
.exe - runs
9. Recommended VS Code Layout
I split the screen into 3 sections:
Left:
code.cpp
Top-right:
input.txt
Bottom-right:
output.txt
Use:
Ctrl + \
to split editor.
This setup is AMAZING for CP practice.
10. GitHub Setup
Initialize repo:
git init
git add .
git commit -m "Initial commit"
Connect GitHub:
git remote add origin YOUR_REPO_URL
git branch -M main
git push -u origin main
Future workflow:
git add .
git commit -m "Solved new problem"
git push
11. Important .gitignore
Create:
.gitignore
Add:
*.exe
output.txt
Don’t push generated binaries.
Final Thoughts
This setup improved my workflow A LOT.
Benefits:
- cleaner practice
- reusable workflow
- faster debugging
- organized DSA repo
- GitHub tracking
- professional structure
If anyone wants, I can also share:
- VS Code settings
- debugging setup
- contest workflow
Built full disassembler & decompiler for Reverse Engineering | Free and open source.
I wanted a disassembler that's a single executable, loads instantly, runs everywhere. So I wrote one from scratch.
It's called Hyperion it's made in C++, No runtime dependencies. No installer.
What it actually does: It has a real decompiler, It produces readable pseudo-C for x86/x64 and ARM64.
Formats & architectures:
| Format | Architectures |
|---|---|
| PE (exe, dll, sys) | x86, x64 |
| ELF (so, o, executables) | x86, x64, ARM, ARM64, MIPS, PPC |
| Mach-O (dylib, fat/universal) | x64, ARM64 |
| .NET (managed assemblies) | CIL/IL bytecode |
Scripting:
Embedded Lua 5.4. Drop .lua plugins in a folder. Full API, rename, comment, patch bytes, create functions, navigate, query xrefs. Register custom menu items and hotkeys from scripts.
The numbers:
| Hyperion | IDA Pro | Ghidra | |
|---|---|---|---|
| Download size | <3 MB | ~120 MB | ~500 MB |
| Runtime deps | None | Python, Qt | JVM |
| Price | Free (MIT) | $1,800/yr | Free |
| Startup time | <1s | ~3s | ~15s |
| Binary | Single exe | Installer | Installer |
Platforms: Windows, Linux, macOS (Intel + Apple Silicon).
This will stay open source and free. MIT licensed.
Simulating Infinity in Conway's Game of Life with Modern C++
ryanjk5.github.ioNeoclassical C++: segmented iterators revisited (1)
Hi,
I've written a blog post revisiting Matt Austern's great Segmented Iterators and Hierarchical Algorithms paper (2000) and benchmarking an experimental implementation I've been playing with in Boost.Container.
Quick idea: std::deque and friends are internally segmented (blocks of contiguous memory), but STL-like iterators hide that, so every ++it has to check for a block boundary. Austern's proposal splits the iterator into a segment_iterator (walks blocks) + local_iterator (inside one block), so algorithms can run a tight loop per block and only do bookkeeping at the boundaries.
I benchmarked several "simple" STL algorithms on a Boost deque, and the speedup is way bigger than Austern's original estimation when modern auto-vectorizers enter the game.
Article link: https://boostedcpp.net/2026/05/18/neoclassical-c-segmented-iterators-revisited-1/
Happy to receive feedback!
A two phase to-string API? First compute the total size of the single allocation then populate the bytes?
I'm trying to see if there is any prior art in the to-string space that accomplishes a two phase approach. In theory, you could design an API that first asks the data "How many bytes would it take to make you into a string?". From there it could allocate memory with that capacity. Then it could provide that allocation to the same data to populate the bytes.
I want it to be very light weight and easy to add to a type. The `AbslHashValue(...)` API really nails the ergonomics of an extension point in C++, imo. But when it comes to to-string, it gets pretty hairy pretty fast.
`AbslHashValue(...)` benefits from the fact that the resulting hash has a fixed bit width. You just combine/combine_contiguous recursively and you're done.
Some hypothetical `MyToString(...)` would need to likely be split into two functions. `MyToStringSize(...)` and `MyToStringValue(...)` which already makes it more obnoxious to add support for in your type.
But it gets worse. What if inclusion of the type names is important? I can imagine wanting a lever at the top level that says do or do not include them. So for the case where you do include them, how do you succinctly compute the length of namespace + scope-resolution-operator + type name.
And what about templates? Do you also include the angle brackets? Do you recursively include type names between them? And what about potential line noise like allocator types? I can see wanting to include them and not wanting to include them.
Further, what about hashtables? If you store the keys and values in separate ranges for a more data oriented design, how do you model the fact that each K-V pair goes together? You don't want to copy them because that might be expensive. So do you supply a proxy object where it has two pointers? Now that means you have to build an entire TYPE inside your type just to support to-string. Not very ergonomic.
Anyway, wanted to discuss this to see if anyone has ideas in this space. It seems to me that to-string as an operation should be unambiguously single allocation. But unless I'm mistaken, `absl::StrCat(...)` and other such APIs only "limit" the number of allocations and cannot put the upper bound at exactly 1.
Auto Non-Static Data Member Initializers are holding back lambdas in RAII (+ coroutine workaround)
TLDR: Auto non-static data member variables allow objects to store lambdas, thereby improving readability and reducing the need for type erasure.
Type deduction and auto variables are one of the defining features of modern C++, but unfortunately they are not available to class data members:
struct {
// error: non-static data member declared with placeholder 'auto'
auto x = 1;
// error: invalid use of template-name 'std::vector' without an argument list
std::vector y { 1, 2, 3 };
}
This blog post (from 2018!) by Corentin Jabot does a good job outlining this problem so I'll point to it first: The case for Auto Non-Static Data Member Initializers. However, I would like to expand specifically on lambdas as they are mostly glossed over.
Lambdas can only be stored in auto variables because each lambda is given a unique type, even if two lambdas are identical in their definition. As pointed out in the blog post, even decltype([]{}) foo = []{}; is not permitted.
Because of this it is not possible to store a lambda inside an object, even if the storage requirements can otherwise easily be determined.
Real world example
An embedded project I am working on makes heavy use of RAII: so much so that most of our subsystems have little to no functional code, just classes composed of lower level building blocks as data members (representing e.g. GPIOs, UARTs) and some minimal routing between them.
This routing usually takes the form of RAII event callback objects that store the callback function, register themselves in an intrusive list to receive the events, and unregister themselves on destruction. This ensures that we can freely shut down subsystems without worrying about lifetime issues - destruction is always in reverse order and easy to understand at a glance.
struct gpio_uart_forwarder {
peripheral::gpio gpio_in {};
peripheral::gpio gpio_out {};
peripheral::uart uart {};
evt::callback<bool> gpio_to_uart { gpio_in.on_change, [&](bool high) {
uart.write(high ? '1' : '0');
} };
evt::callback<char> uart_to_gpio { uart.on_char, [&](char c) {
if (c == '1') gpio_out.set(1);
else if (c == '0') gpio_out.set(0);
} };
}
The only way this is currently possible is using type erasure, i.e. std::[move_only_]function.
In an ideal world, we would instead have the callback templated on the function type:
template<typename Ev, std::invocable<const Ev &> Fn>
class callback {
Fn f;
...
}
And our class would look like:
struct gpio_uart_forwarder {
...
auto gpio_to_uart = evt::callback { gpio_in.on_change, [&](bool high) { ... } };
// OR
evlp::callback uart_to_gpio { uart.on_char, [&](char c) { ... } };
}
While an std::function might not seem like a huge price to pay, across an entire program it builds up to hundreds of unnecessary heap allocations, thousands of bytes wasted and extra indirections - all for type erasure that we don't actually need! We know all the types involved, and we own the storage ourselves.
Proposal to fix
The last time a formal proposal was made to fix this was way back in 2008 by Bill Seymour: N2713 - Allow auto for non-static data members.
I understand there are complications in determining the size and layout of objects with auto members, but 18 years later this seems like pretty low hanging fruit compared to what has recently been achieved with reflection!
Edge cases such as recursive definitions and references to this or sizeof should simply be banned rather than resulting in the feature being disabled entirely.
Coroutine workaround
In my quest for a solution I have discovered that coroutines can be abused to get the best of both worlds.
If you don't need external access to the data members and just want to benefit from RAII, you can convert the class to a coroutine that suspends itself right before ending:
class scope {
public:
struct promise_type {
scope get_return_object() noexcept {
return scope { std::coroutine_handle<promise_type>::from_promise(*this) };
}
std::suspend_never initial_suspend() noexcept { return {}; }
std::suspend_always final_suspend() noexcept { return {}; }
void return_void() noexcept {}
void unhandled_exception() { std::terminate(); }
};
...
~scope() {
if (!this->handle) return;
this->handle.destroy();
this->handle = {};
}
private:
explicit scope(std::coroutine_handle<promise_type> h) noexcept : handle(h) {}
std::coroutine_handle<promise_type> handle {};
};
scope gpio_uart_forwarder() {
auto gpio_in = peripheral::gpio {};
auto gpio_out = peripheral::gpio {};
auto uart = peripheral::uart {};
auto gpio_to_uart = evt::callback { gpio_in.on_change, [&](bool high) {
uart.write(high ? '1' : '0');
} };
auto uart_to_gpio = evt::callback { uart.on_char, [&](char c) {
if (c == '1') gpio_out.set(1);
else if (c == '0') gpio_out.set(0);
} };
// All local variables remain alive until coroutine is destroyed
co_await std::suspend_always();
// Can't rely on final_suspend because stack is already destroyed by then
// But we need a co_ statement anyway to turn it into a coroutine
}
scope my_gpio_uart_forwarder = gpio_uart_forwarder();
Far from perfect, but it reduces us to a single heap allocation plus the minimal overhead of launching the coroutine, no matter how many callbacks we define.
Maybe the best part is it can be used within an existing class too, preserving standard object RAII:
struct gpio_uart_forwarder {
peripheral::gpio gpio_in {};
peripheral::gpio gpio_out {};
peripheral::uart uart {};
scope callbacks = [&] -> scope {
auto gpio_to_uart = evt::callback { gpio_in.on_change, [&](bool high) {
uart.write(high ? '1' : '0');
} };
auto uart_to_gpio = evt::callback { uart.on_char, [&](char c) {
if (c == '1') gpio_out.set(1);
else if (c == '0') gpio_out.set(0);
} };
co_await std::suspend_always();
}();
}
I built SpriteForge: A free, lightweight 2D Pixel Editor in C++17 (0% idle CPU, no Electron BS)
Hey everyone!
I was getting tired of heavy, Electron-based tools eating up my RAM just to draw some pixel art and animations, so I rolled up my sleeves and built my own from scratch.
It’s called SpriteForge. It's written entirely in modern C++ (C++17) using SDL2, OpenGL 3.0, and Dear ImGui. It's completely free and open-source.
I focused heavily on making it as lightweight, fast, and portable as possible.
Some cool under-the-hood stuff I'm really proud of:
- 0% Idle CPU/GPU Usage: I hooked up the main loop to
SDL_WaitEventTimeout. If you aren't actively drawing or playing an animation, the thread goes to sleep. It consumes literally zero CPU/GPU. No laptop heating up! - Ultra-Low RAM Footprint: The undo/redo history engine just stores raw pixel byte arrays of the layers instead of duplicating heavy textures. You can run this comfortably on an ancient 256MB RAM machine.
- Standalone Portable EXE: I cross-compiled the Windows version using MinGW-w64 and statically linked the GCC/C++ runtimes (
-static-libgcc -static-libstdc++). No more "missing libstdc++-6.dll" errors on fresh Windows installs. Just double-click the ~3MB.exeand it opens instantly. - Added a fun, custom cyberpunk "YGCODES" boot splash screen just for the aesthetic. 😃
Features for actual drawing:
- Multi-threaded layer blending (Normal, Multiply, Add, Screen)
- Full animation timeline with Onion Skinning
- Tools: Bresenham interpolated pen (no gaps when drawing fast), stack-safe flood fill, shape tools, etc.
- Built-in retro palettes (PICO-8, DawnBringer)
- Custom
.sforgebinary project saving and SpriteSheet/PNG exports.
I’d love for you guys to try it out, poke around the source code, or use it for your indie games.
Repo link: https://github.com/YGCODES1/SpriteAnim
Let me know what you think or if you have any feature requests!