
citor: a header-only C++20 thread pool tuned for sub-µs dispatch
I just released citor, a small header-only C++20 thread pool / parallel runtime aimed at CPU-bound workloads where per-dispatch latency actually shows up in the profile.
Repo: https://github.com/Lallapallooza/citor
The main idea is: keep the common CPU-parallel shapes in one pool, avoid per-call allocations on the hot path, let the producer participate as slot 0, and make short repeated phases cheaper than repeatedly waking a worker team.
The simplest thing looks like what you'd expect:
citor::ThreadPool pool(8);
pool.parallelFor<citor::HintsDefaults>(
0, data.size(),
[&](std::size_t lo, std::size_t hi) {
for (std::size_t i = lo; i < hi; ++i)
data[i] *= 2;
});
Beyond parallelFor, it has deterministic parallelReduce, parallelScan, parallelChain, runPlex for repeated phases over the same partition, recursive forkJoin with per-worker Chase-Lev deques, bulkForQueries, and submitDetached. There is also a PoolGroup that creates one arena per shared-L3 group, mostly useful on multi-CCD Zen.
A few internals that ended up mattering more than I expected:
- each worker owns a cache-line-aligned mailbox and the whole dispatch protocol is a per-slot mailbox stamp, no shared queue
- the producer can short-circuit small jobs by CAS-ing the worker's mailbox to DONE itself and running the body inline, no wake at all (worker's own ack races the producer's self-stamp, loser short-circuits);
- the join barrier is a per-slot done-epoch scan with cancellation riding the same epoch read, so no shared sense bit and no per-iteration cancel poll
- the worker's spin-entry
rdtscpdoubles as a store-buffer drain, so the producer sees the DONE stamp before its next mailbox read - free side benefit of timing the spin kCacheLineis 128 bytes rather than 64 because Zen prefetches in cache-line pairs and contended atomics get measurably worse if you size to 64.
For perf, I wrote a comparative harness against BS::thread_pool, dp::thread_pool, task-thread-pool, riften, oneTBB, Taskflow, Eigen, OpenMP, Leopard, dispenso, libfork, and TooManyCooks. Competitor revisions are pinned, host gates are printed at startup, OpenMP wait policy is normalized, and raw samples can be exported as JSON.
In my current benchmark sweep, citor wins roughly:
- 92% of contested cells on a Ryzen 9950X3D
- 75% on a 96-core Genoa box
- 69% on a 48-core Sapphire Rapids box
Hot fan-out dispatch on the 9950X3D is usually in the 100-400 ns range depending on participant count and shape.
Please treat those as "my harness on my machines or aws," not universal truth. If the numbers matter to your use case, run the benchmark yourself. The README has the methodology and reproduction commands.
There is real work left:
- topology detection is still shaped mostly around Zen CCDs
- multi-socket EPYC, sub-NUMA clustering, hybrid P/E cores, and Intel mesh are not first-class yet
- parallelReduce uses static contiguous chunks and does not steal after a worker finishes, so heavy-tail bodies can leave cores idle
- the coroutine wrapper queues on a per-pool driver thread rather than doing continuation stealing
- bulkForQueries only fans across queries today a true 2D fan is probably the next useful shape.
What citor is not:
- not an I/O executor
- not a general async/future abstraction
- not a TBB or OpenMP replacement for arbitrary workloads
- not tuned equally for every CPU topology
I'd especially like feedback on benchmark fairness, API shape before 1.0, missing competitors, and whether the affinity / pinning behavior is too surprising for a library like this and for sure any perf improvenments suggestions. If anything in the README reads like overclaiming, I'd rather fix it now.