u/Diligent_Comb5668

Yeah so I wanted to see if you can actually verify a zero-knowledge proof on a real microcontroller. Not a phone, not a Raspberry Pi 4 running Linux, an actual MCU with kilobytes of RAM.

Turns out you can. Here is the repo: Niek-Kamer/zkmcu.

What it does

It verifies Groth16 proofs over the BN254 curve on the Raspberry Pi Pico 2 W. BN254 is the same curve Ethereum uses for it's pairing precompile (EIP-196/EIP-197) so any Ethereum compatible prover produces proofs that zkmcu verifies, byte for byte.

Two firmwares, one Rust source tree. Runs on the Cortex-M33 core and on the Hazard3 RISC-V core that are both sitting on the same RP2350 die. Same crypto code, different ISA, same test vectors.

Why bother

Okay so what annoyed me into this: right now there is basically no option to verify a ZK proof on an actual device. Like a hardware wallet. Ledger doesnt do it, Trezor doesnt, Keystone/OneKey/Tangem, none of them. Every ZK verify in the wild runs on a phone or a server.

But theres a whole class of stuff that wants on device verification:

Hardware wallets checking a ZK claim locally, without trusting the usb host.
Offline credential validators (transit, festivals, borders).
IoT attestation where the relying party is itself a small chip.
Supply chain provenance tags that verify certs without internet.

Nobody ships this. The closest prior art I could find is ZPiE from 2021 wich runs C on a Raspberry Pi Zero under Linux. A Pi Zero is a tiny Linux computer, not a microcontroller IMO.

The numbers

Measured on the chip, from the serial log. No emulation, no extrapolation.

Operation	Cortex-M33	Hazard3 RV32	Ratio
G1 scalar mul (typical)	16.5 M cyc / 110 ms	10.8 M cyc / 72 ms	0.65× (RV32 wins lol)
G2 scalar mul (typical)	31.5 M cyc / 210 ms	42.5 M cyc / 284 ms	1.35×
BN254 pairing	80.0 M cyc / 533 ms	105.1 M cyc / 701 ms	1.31×
Full Groth16 verify	144.4 M cyc / 962 ms	201.2 M cyc / 1341 ms	1.39×

962 ms for a full Groth16 verify on a $7 chip at 150 MHz. Variance across iterations: 0.06%. Pretty insane stability.

Your 9950X3D running the same substrate-bn code would have 10 to 50% variance from cache contention, frequency boost, and OS preemption. The consistency on the MCU is ofcourse because the pipeline is in order, clock is fixed, and nothing else is running on the chip. Its a feature for anything with timing side channel concerns or a hard verify deadline.

RAM

Stack painting + a tracking allocator wrapped around embedded_alloc gave me these:

Region	Bytes
`.text`	73 KB
Peak stack during verify	15.6 KB
Peak heap during verify	79.4 KB
Heap arena (confirmed sufficient)	96 KB
Total RAM	~112 KB

So zkmcu fits on any 128 KB SRAM class MCU. That includes nRF52832, STM32F405, Ledgers ST33K1M5, Infineon SLE78, basically the whole hardware wallet grade silicon category.

64 KB SRAM doesnt fit. I was optimistic about 64 KB earlier in the week and the actual measurements corrected me lol. If you want 64 KB you need serial pairings (~2× verify cost) or a Nova style native verifier, so different math, different project.

The RISC-V surprise

Conventional wisdom: ARM wins at big integer math because SMLAL/UMAAL multiply accumulate.

Yeah well substrate-bn is pure Rust and doesnt use any intrinsics. So that ARM advantage is just not realized. Meanwhile Hazard3 has 31 general purpose registers vs Thumb-2's 13, wich means way less register spilling during schoolbook Fp multiplication.

Net result: G1 scalar mul is 35% faster on the Hazard3 RISC-V core than on the Cortex-M33. Same Rust source, same LTO pass, same 150 MHz clock. I haven't seen this called out anywhere and I looked.

On the higher order stuff (Fp², Fp¹², the full pairing) other factors flip the result and M33 wins the overall verify by about 34%. But the G1 number is genuinely new.

Two open problems for someone with time:

What happens if substrate-bn gets hand tuned to SMLAL on Cortex-M? Gap probably inverts.
What happens with Hazard3 Zbb/Zba bit manip extensions actually enabled? Currently unused.

Both probably 2-3× wins. Combined, verify plausibly drops from 1 second to ~150 ms. Would be pretty damn impressive.

What broke along the way

A few things that ate hours and deserve documenting so the next person doesnt repeat.

My elf2uf2-rs was old enough to write an RP2040 family ID. The RP2350 boot ROM silently rejected the UF2 and left the Pico sitting in BOOTSEL. No error log, nothing. Took me like 20 minutes of staring before I figured it out. Just use picotool load -t elf and skip UF2 entirely.
Hazard3 boots with mcountinhibit[CY] = 1. The cycle counter reads zero until you write csrw mcountinhibit, zero somewhere early in main. My first RV32 run reported cycles=0 on every benchmark. Correctness was fine, timing was just dead. Fix is literally one instruction. This is in the Hazard3 README btw, but nobody writes about it in a crypto benchmarking context.
arkworks 0.5 pins rand to 0.8. If you bump rand past that your ark_snark::SNARK::prove wont compile because rand_core 0.10 is a trait level rewrite. Wait for arkworks 0.6 or just dont bump.
The adversarial tests I wrote found a 412 GB allocation DoS in my own VK parser lmao. Vec::with_capacity(num_ic) with attacker controlled num_ic. Set num_ic = u32::MAX in the input and the test runner just SIGABRTs. Would brick an MCU instantly. Patched with checked arithmetic + buffer length validation. Same pattern in parse_public. Both fixed in v0.1.0.
substrate-bn::Fr::from_slice silently reduces non canonical encodings. For verify correctness it doesnt matter (Fr reduction is mod r, pairing result is invariant) but if your using the raw Fr bytes as an identity (nullifier, replay tag, merkle leaf) thats a malleability vuln. Added a strict < r check at our call site.

All of this is in /research/postmortems/findings/ as dated postmortems. Future maintenance starts there.

Public input scaling is weirder than I expected

Most treatments of Groth16 verify cost treat "cost per public input" as a constant. Its not.

The vk_x = ic[0] + Σ x[i]·ic[i+1] step is one G1 scalar mul per public input. G1 scalar mul cost depends on the numerical size of the scalar, not just the count:

Input shape	Scalar bits	Extra ms per input
counter / index	< 2^16	~3 ms
Ethereum address	~160	~40 ms
Merkle root	256 random	~67 ms

A 10-input circuit can take anywhere from 990 ms (ten counters) to 1600 ms (ten Merkle roots). 20× spread just from how many bits the public inputs actually contain.

If your designing a circuit and deciding whether to pack small values into one Fr vs leaving them separate, this matters a lot.

What this unlocks

Hardware wallets with local ZK verification on any 128 KB SRAM Secure Element.
Offline credential verifiers for real world offline deployments.
IoT attestation where the receiver is itself a $7 chip.
First public benchmark of BN254 pairing math on real RISC-V silicon, with a surprise result.

Read it

Code: Niek-Kamer/zkmcu, MIT OR Apache-2.0.
Whitepaper + prior art survey + session report in research/out/*.pdf.
Per run benchmark data in benchmarks/runs/*/result.toml.

So yeah that was my weekend.

What the fuck did I just watch 😂

Hans, why did every women in this store look so weird at me after I asked how much they charge?

Say no more I'm already writing tokkies as we speak.

So I put Groth16 on a microcontroller