u/ShabelonMagician

I just released citor, a small header-only C++20 thread pool / parallel runtime aimed at CPU-bound workloads where per-dispatch latency actually shows up in the profile.

Repo: https://github.com/Lallapallooza/citor

The main idea is: keep the common CPU-parallel shapes in one pool, avoid per-call allocations on the hot path, let the producer participate as slot 0, and make short repeated phases cheaper than repeatedly waking a worker team.

The simplest thing looks like what you'd expect:

citor::ThreadPool pool(8);

pool.parallelFor&lt;citor::HintsDefaults&gt;(
    0, data.size(),
    [&amp;](std::size_t lo, std::size_t hi) {
        for (std::size_t i = lo; i &lt; hi; ++i)
            data[i] *= 2;
    });

Beyond parallelFor, it has deterministic parallelReduce, parallelScan, parallelChain, runPlex for repeated phases over the same partition, recursive forkJoin with per-worker Chase-Lev deques, bulkForQueries, and submitDetached. There is also a PoolGroup that creates one arena per shared-L3 group, mostly useful on multi-CCD Zen.

A few internals that ended up mattering more than I expected:

each worker owns a cache-line-aligned mailbox and the whole dispatch protocol is a per-slot mailbox stamp, no shared queue
the producer can short-circuit small jobs by CAS-ing the worker's mailbox to DONE itself and running the body inline, no wake at all (worker's own ack races the producer's self-stamp, loser short-circuits);
the join barrier is a per-slot done-epoch scan with cancellation riding the same epoch read, so no shared sense bit and no per-iteration cancel poll
the worker's spin-entry rdtscp doubles as a store-buffer drain, so the producer sees the DONE stamp before its next mailbox read - free side benefit of timing the spin
kCacheLine is 128 bytes rather than 64 because Zen prefetches in cache-line pairs and contended atomics get measurably worse if you size to 64.

For perf, I wrote a comparative harness against BS::thread_pool, dp::thread_pool, task-thread-pool, riften, oneTBB, Taskflow, Eigen, OpenMP, Leopard, dispenso, libfork, and TooManyCooks. Competitor revisions are pinned, host gates are printed at startup, OpenMP wait policy is normalized, and raw samples can be exported as JSON.

In my current benchmark sweep, citor wins roughly:

92% of contested cells on a Ryzen 9950X3D
75% on a 96-core Genoa box
69% on a 48-core Sapphire Rapids box

Hot fan-out dispatch on the 9950X3D is usually in the 100-400 ns range depending on participant count and shape.

Please treat those as "my harness on my machines or aws," not universal truth. If the numbers matter to your use case, run the benchmark yourself. The README has the methodology and reproduction commands.

There is real work left:

topology detection is still shaped mostly around Zen CCDs
multi-socket EPYC, sub-NUMA clustering, hybrid P/E cores, and Intel mesh are not first-class yet
parallelReduce uses static contiguous chunks and does not steal after a worker finishes, so heavy-tail bodies can leave cores idle
the coroutine wrapper queues on a per-pool driver thread rather than doing continuation stealing
bulkForQueries only fans across queries today a true 2D fan is probably the next useful shape.

What citor is not:

not an I/O executor
not a general async/future abstraction
not a TBB or OpenMP replacement for arbitrary workloads
not tuned equally for every CPU topology

I'd especially like feedback on benchmark fairness, API shape before 1.0, missing competitors, and whether the affinity / pinning behavior is too surprising for a library like this and for sure any perf improvenments suggestions. If anything in the README reads like overclaiming, I'd rather fix it now.

citor: a header-only C++20 thread pool tuned for sub-µs dispatch