
Distributed Checkpoint Storage from scratch using 4x Raspberry Pis
- your model just finished training after 3 days.
- you go to load the checkpoint.
- disk failure.
gone.
I know the obvious answer is “just upload checkpoints to Hugging Face/S3/etc”, but I wanted to understand what actually happens underneath distributed storage systems, so I built a tiny checkpoint replication system from scratch over raw TCP sockets.
The goal was simple: replicate training checkpoints across cheap cluster nodes so a single SSD/SD-card death wouldn’t kill long-running training.
A few interesting engineering problems popped up while building it:
- checkpoint writes are not atomic → watcher sometimes detects partially-written safetensors
- slow Raspberry Pi SD cards created backpressure during parallel shard replication
- retry logic without checksums caused silent corruption bugs early on
- mDNS discovery sounds simple until nodes disappear/rejoin mid-transfer
- shard sizing mattered much more than expected because tiny shards killed throughput with socket overhead
Current design:
- coordinator splits safetensors into shards
- each shard replicated to 2 workers
- SHA-256 verification on every transfer
- automatic fallback to replica during restore
- filesystem watcher retries incomplete checkpoints until finalized
- Prometheus/Grafana/Loki stack for monitoring + alerts
Setup I tested on: Mac Mini M4 coordinator + 4 Raspberry Pi workers, though any Linux/macOS mix should work.
Honestly the most useful part wasn’t even the storage system itself — it forced me to finally understand TCP flow control, retries, backpressure, partial writes, and distributed failure handling in a very practical way.
Curious how others here handle checkpoint durability on small/home clusters without relying entirely on cloud object storage.
Fully open source.
Here's exactly how it works:
- Store: Coordinator splits the .safetensors into N shards, computes SHA-256 for each, sends in parallel with retry + exponential backoff. Every shard lives on TWO machines.
- Gather: Pull from primaries. One node dead? Silently falls back to replica and reassembles merged.safetensors.
- Watcher: Daemon auto-detects new checkpoints, syncs them live. Still writing? Goes to pending queue and retries every 10s. Fully hands-off.
- Discovery: Workers auto-advertise via mDNS. No hardcoded IPs. Add/remove nodes like magic.
Setup is whatever you already have: I used a Mac mini M4 as coordinator + 4× Raspberry Pi 4 workers. Any Linux/macOS mix works.
Monitoring? Prometheus + Grafana + Loki in Docker. Per-shard speeds, error counts, unified logs, email alerts if anything goes unrecoverable. No SSH hell.
One yaml config. One launch.sh. Done.
If you're training on a home/dorm cluster and living in fear of losing 3-day runs… this is for you.