
u/Diligent_Comb5668

Hans, why did every women in this store look so weird at me after I asked how much they charge?
I just don't understand please explain.
Say no more I'm already writing tokkies as we speak.
https://www.nytimes.com/2026/04/24/climate/amoc-bering-strait-dam.html
On a more serious note, how would a fucking wall in the ocean ever help ocean currents? Like, I'm feeling like somebody is stupid here. It's either me or Jelle. It's probably me, but still. Am I stupid or not?
Usually a dam solves problems just not this one I feel like 😂
So I put Groth16 on a microcontroller
Yeah so I wanted to see if you can actually verify a zero-knowledge proof on a real microcontroller. Not a phone, not a Raspberry Pi 4 running Linux, an actual MCU with kilobytes of RAM.
Turns out you can. Here is the repo: Niek-Kamer/zkmcu.
What it does
It verifies Groth16 proofs over the BN254 curve on the Raspberry Pi Pico 2 W. BN254 is the same curve Ethereum uses for it's pairing precompile (EIP-196/EIP-197) so any Ethereum compatible prover produces proofs that zkmcu verifies, byte for byte.
Two firmwares, one Rust source tree. Runs on the Cortex-M33 core and on the Hazard3 RISC-V core that are both sitting on the same RP2350 die. Same crypto code, different ISA, same test vectors.
Why bother
Okay so what annoyed me into this: right now there is basically no option to verify a ZK proof on an actual device. Like a hardware wallet. Ledger doesnt do it, Trezor doesnt, Keystone/OneKey/Tangem, none of them. Every ZK verify in the wild runs on a phone or a server.
But theres a whole class of stuff that wants on device verification:
- Hardware wallets checking a ZK claim locally, without trusting the usb host.
- Offline credential validators (transit, festivals, borders).
- IoT attestation where the relying party is itself a small chip.
- Supply chain provenance tags that verify certs without internet.
Nobody ships this. The closest prior art I could find is ZPiE from 2021 wich runs C on a Raspberry Pi Zero under Linux. A Pi Zero is a tiny Linux computer, not a microcontroller IMO.
The numbers
Measured on the chip, from the serial log. No emulation, no extrapolation.
| Operation | Cortex-M33 | Hazard3 RV32 | Ratio |
|---|---|---|---|
| G1 scalar mul (typical) | 16.5 M cyc / 110 ms | 10.8 M cyc / 72 ms | 0.65× (RV32 wins lol) |
| G2 scalar mul (typical) | 31.5 M cyc / 210 ms | 42.5 M cyc / 284 ms | 1.35× |
| BN254 pairing | 80.0 M cyc / 533 ms | 105.1 M cyc / 701 ms | 1.31× |
| Full Groth16 verify | 144.4 M cyc / 962 ms | 201.2 M cyc / 1341 ms | 1.39× |
962 ms for a full Groth16 verify on a $7 chip at 150 MHz. Variance across iterations: 0.06%. Pretty insane stability.
Your 9950X3D running the same substrate-bn code would have 10 to 50% variance from cache contention, frequency boost, and OS preemption. The consistency on the MCU is ofcourse because the pipeline is in order, clock is fixed, and nothing else is running on the chip. Its a feature for anything with timing side channel concerns or a hard verify deadline.
RAM
Stack painting + a tracking allocator wrapped around embedded_alloc gave me these:
| Region | Bytes |
|---|---|
.text |
73 KB |
| Peak stack during verify | 15.6 KB |
| Peak heap during verify | 79.4 KB |
| Heap arena (confirmed sufficient) | 96 KB |
| Total RAM | ~112 KB |
So zkmcu fits on any 128 KB SRAM class MCU. That includes nRF52832, STM32F405, Ledgers ST33K1M5, Infineon SLE78, basically the whole hardware wallet grade silicon category.
64 KB SRAM doesnt fit. I was optimistic about 64 KB earlier in the week and the actual measurements corrected me lol. If you want 64 KB you need serial pairings (~2× verify cost) or a Nova style native verifier, so different math, different project.
The RISC-V surprise
Conventional wisdom: ARM wins at big integer math because SMLAL/UMAAL multiply accumulate.
Yeah well substrate-bn is pure Rust and doesnt use any intrinsics. So that ARM advantage is just not realized. Meanwhile Hazard3 has 31 general purpose registers vs Thumb-2's 13, wich means way less register spilling during schoolbook Fp multiplication.
Net result: G1 scalar mul is 35% faster on the Hazard3 RISC-V core than on the Cortex-M33. Same Rust source, same LTO pass, same 150 MHz clock. I haven't seen this called out anywhere and I looked.
On the higher order stuff (Fp², Fp¹², the full pairing) other factors flip the result and M33 wins the overall verify by about 34%. But the G1 number is genuinely new.
Two open problems for someone with time:
- What happens if
substrate-bngets hand tuned to SMLAL on Cortex-M? Gap probably inverts. - What happens with Hazard3
Zbb/Zbabit manip extensions actually enabled? Currently unused.
Both probably 2-3× wins. Combined, verify plausibly drops from 1 second to ~150 ms. Would be pretty damn impressive.
What broke along the way
A few things that ate hours and deserve documenting so the next person doesnt repeat.
My
elf2uf2-rswas old enough to write an RP2040 family ID. The RP2350 boot ROM silently rejected the UF2 and left the Pico sitting in BOOTSEL. No error log, nothing. Took me like 20 minutes of staring before I figured it out. Just usepicotool load -t elfand skip UF2 entirely.Hazard3 boots with
mcountinhibit[CY] = 1. The cycle counter reads zero until you writecsrw mcountinhibit, zerosomewhere early inmain. My first RV32 run reportedcycles=0on every benchmark. Correctness was fine, timing was just dead. Fix is literally one instruction. This is in the Hazard3 README btw, but nobody writes about it in a crypto benchmarking context.arkworks0.5 pinsrandto 0.8. If you bumprandpast that yourark_snark::SNARK::provewont compile becauserand_core0.10 is a trait level rewrite. Wait for arkworks 0.6 or just dont bump.The adversarial tests I wrote found a 412 GB allocation DoS in my own VK parser lmao.
Vec::with_capacity(num_ic)with attacker controllednum_ic. Setnum_ic = u32::MAXin the input and the test runner just SIGABRTs. Would brick an MCU instantly. Patched with checked arithmetic + buffer length validation. Same pattern inparse_public. Both fixed inv0.1.0.substrate-bn::Fr::from_slicesilently reduces non canonical encodings. For verify correctness it doesnt matter (Fr reduction is modr, pairing result is invariant) but if your using the raw Fr bytes as an identity (nullifier, replay tag, merkle leaf) thats a malleability vuln. Added a strict< rcheck at our call site.
All of this is in /research/postmortems/findings/ as dated postmortems. Future maintenance starts there.
Public input scaling is weirder than I expected
Most treatments of Groth16 verify cost treat "cost per public input" as a constant. Its not.
The vk_x = ic[0] + Σ x[i]·ic[i+1] step is one G1 scalar mul per public input. G1 scalar mul cost depends on the numerical size of the scalar, not just the count:
| Input shape | Scalar bits | Extra ms per input |
|---|---|---|
| counter / index | < 2^16 | ~3 ms |
| Ethereum address | ~160 | ~40 ms |
| Merkle root | 256 random | ~67 ms |
A 10-input circuit can take anywhere from 990 ms (ten counters) to 1600 ms (ten Merkle roots). 20× spread just from how many bits the public inputs actually contain.
If your designing a circuit and deciding whether to pack small values into one Fr vs leaving them separate, this matters a lot.
What this unlocks
- Hardware wallets with local ZK verification on any 128 KB SRAM Secure Element.
- Offline credential verifiers for real world offline deployments.
- IoT attestation where the receiver is itself a $7 chip.
- First public benchmark of BN254 pairing math on real RISC-V silicon, with a surprise result.
Read it
- Code: Niek-Kamer/zkmcu, MIT OR Apache-2.0.
- Whitepaper + prior art survey + session report in
research/out/*.pdf. - Per run benchmark data in
benchmarks/runs/*/result.toml.
So yeah that was my weekend.