u/East-Muffin-6472

Distributed Checkpoint Storage from scratch using 4x Raspberry Pis

Distributed Checkpoint Storage from scratch using 4x Raspberry Pis

  • your model just finished training after 3 days.
  • you go to load the checkpoint.
  • disk failure.

gone.

I know the obvious answer is “just upload checkpoints to Hugging Face/S3/etc”, but I wanted to understand what actually happens underneath distributed storage systems, so I built a tiny checkpoint replication system from scratch over raw TCP sockets.

The goal was simple: replicate training checkpoints across cheap cluster nodes so a single SSD/SD-card death wouldn’t kill long-running training.

A few interesting engineering problems popped up while building it:

  • checkpoint writes are not atomic → watcher sometimes detects partially-written safetensors
  • slow Raspberry Pi SD cards created backpressure during parallel shard replication
  • retry logic without checksums caused silent corruption bugs early on
  • mDNS discovery sounds simple until nodes disappear/rejoin mid-transfer
  • shard sizing mattered much more than expected because tiny shards killed throughput with socket overhead

Current design:

  • coordinator splits safetensors into shards
  • each shard replicated to 2 workers
  • SHA-256 verification on every transfer
  • automatic fallback to replica during restore
  • filesystem watcher retries incomplete checkpoints until finalized
  • Prometheus/Grafana/Loki stack for monitoring + alerts

Setup I tested on: Mac Mini M4 coordinator + 4 Raspberry Pi workers, though any Linux/macOS mix should work.

Honestly the most useful part wasn’t even the storage system itself — it forced me to finally understand TCP flow control, retries, backpressure, partial writes, and distributed failure handling in a very practical way.

Curious how others here handle checkpoint durability on small/home clusters without relying entirely on cloud object storage.

Fully open source.

Here's exactly how it works:

  • Store: Coordinator splits the .safetensors into N shards, computes SHA-256 for each, sends in parallel with retry + exponential backoff. Every shard lives on TWO machines.
  • Gather: Pull from primaries. One node dead? Silently falls back to replica and reassembles merged.safetensors.
  • Watcher: Daemon auto-detects new checkpoints, syncs them live. Still writing? Goes to pending queue and retries every 10s. Fully hands-off.
  • Discovery: Workers auto-advertise via mDNS. No hardcoded IPs. Add/remove nodes like magic.

Setup is whatever you already have: I used a Mac mini M4 as coordinator + 4× Raspberry Pi 4 workers. Any Linux/macOS mix works.

Monitoring? Prometheus + Grafana + Loki in Docker. Per-shard speeds, error counts, unified logs, email alerts if anything goes unrecoverable. No SSH hell.

One yaml config. One launch.sh. Done.

If you're training on a home/dorm cluster and living in fear of losing 3-day runs… this is for you.

u/East-Muffin-6472 — 2 days ago
▲ 34 r/RASPBERRY_PI_PROJECTS+4 crossposts

From Mac Minis to AI Clusters: Learning Distributed Systems For Dummies!

Hey everyone!

Over the next few weeks, I’ll be releasing blogs and guides around learning distributed learning and building your own small compute clusters.

  • The goal is simple: help more people get started with running and training AI models using the hardware they already have lying around. Old laptops, MacBooks, Mac minis, Jetson Nanos, Raspberry Pis, even phones and tablets.

Distributed learning often feels intimidating from the outside, but it’s genuinely one of the coolest areas in systems and AI once you start playing with it yourself.

Before we get into the fun stuff like distributed inference and training, the first few posts will focus on setting up hardware properly and building a working cluster environment, basically subtle amount of cabling and networking!

The early guides will specifically cover setups around:

  • MacBooks and Mac minis
  • Jetson devices
  • Raspberry Pis

After that, we’ll move into quick demos (smolcluster 👀) , and gradually learn the fundamentals side-by-side while actually running models across devices.

I’m building this alongside smolcluster, so a lot of the content will stay very hands-on and practical instead of purely theoretical.

Hopefully this helps more people realize that distributed AI systems are not something reserved only for giant datacenters anymore.

There is just one question I want to answer: are heterogenous clusters, like what I am trying to make above, even possible for running models?

Well, we'll know and till then do read me blog and let me know what you all think! Any comment, feedback etc are very welcome. (pls be gentle since its my first time writing one all by myself haha)

Read -> Blog

Hail LocalAI!

u/East-Muffin-6472 — 5 days ago

So, here's an update to my GRPO training on length constrained reddit posts summarization on 3x Mac minis - a new direction!

>Gist- been trying to test how good of a summarization model can be trained for summarization using exactly 64 tokens!

So, once all the t-test and evals were done for LFM2.5.-350M and Qwen2.5-0,5B-Instruct models with length penalty and quality metrics (given below), I realized after looking at the results of the quality metrics and saw that BLEU and ROUGE-L were particularly low when trained from scratch.

>I hypothesized its because of the length penalty that I added so that it outputs ex ally 64 tokens but also being penalized from the rest variation of length penalty from ROUGE-L and BLEU (brevity penalty for eg).

Well, I had a faint idea to circumvent this issue that is what if I used an already fine tuned version who outputs exactly 64 tokens? But the idea was like a flash, like zoooom and puff gone!

That is when a Redditor pointed it out and I was like "hmm well I already have a checkpoint with only length penalty added!"

Now here I could have just SFT'ed as some of you may be thinking to fine tune the model to output just the read number of token and yes that's next experiment along with DPO comparison !

So, currently, have been training LFM2.5-350M and Qwen2.5-0.5B-Instruct for the same!

>

  • Eval:

>LLM-as-a-Judge (gpt-5)

>Used DeepEval to build a judge pipeline scoring each summary on 4 axes:

  • Faithfulness — no hallucinations vs. source
  • Coverage — key points captured
  • Conciseness — shorter, no redundancy
  • Clarity — readable on its own

>

  • Distributed Training Setup:

>3x Mac Minis in a cluster running MLX.

>One node drives training using GRPO, two push rollouts via vLLM-metal framework.

>All of the work done using smolcluster.

>Used SyncPS arch which is synchronous parameter server architecture with the master as the node where the training happens and the vllm on the workers nodes.

u/East-Muffin-6472 — 15 days ago

just integrated grove into smolcluster and it's genuinely one of the cleanest pieces of infra I've plugged in

  • grove is a package built by some really sharp person, it handles zero-config node discovery and gives you a live terminal dashboard for distributed training.

I did faced the same problem, the problem of having to setup the SSH, networking, cables etc for every node I want to add to my cluster for training since I began to use smolcluster for my own projects , sigh...you know the pain right?

though the best I could is search and realize what I need is auto discovery of nodes, aka mDNS!

Its something that AirDrop uses for seamless auto discovery and data transfer between macOS devices, and Zeroconf for non-macOS ones, though sadly, couldn't come up with a working solution (skill issue it seems haha).

And thats where I found grove, I didn't build grove, I just integrated it.

  • what it does:

>on Mac, nodes discover each other over mDNS — no IPs, no SSH config, nothing! on Linux/Jetson it falls back to TCP + mDNS gives you a live per-rank TUI showing rank, host, loss, grad norm, tokens/sec, network I/O in real time

  • the integration side:

>every smolcluster training algorithm , i.e., FSDP, SyncPS, ClassicDP etc I have reimplemented using pure socket in Python for educational purposes, all of those you can now easily run without worrying about IPs, SSH, networking etc! directly within 2 commands! (before it was like 10 steps ufff - well it still is if you want some serious runs).

  • usage on a 3-node cluster:

>run grove start <script> -n 3 on the coordinator run grove join on each worker the cluster forms itself

that's the whole setup. no static IPs, no config files, no manual port forwarding.

been running this on my 3x Mac Minis and testing on Jetson boards soon.

check it out today at smolcluster[dot]com!

PS: shoutout to @swar_ja for releasing grove!

u/East-Muffin-6472 — 18 days ago

So, with this project I want to see if a length constrained (like 64 tokens only) quality summarization can be done by tiny LLMs using GRPO!

https://preview.redd.it/cy661iefraxg1.png?width=2816&format=png&auto=webp&s=a1f00aeaf597058a8153ccb3debb8ffc7d4b553d

So, I trained two variants of this task:

  • using just length penalty
  • using a single quality reward/combination of those and length penalty

I ran LLM-As-A-Judge eval for checking the summarization quality using DeepEval tools. Those are:

  • Consciencess
  • Coverage
  • Clarity
  • Faitfullness

Th results are as attached and the final one is follows:

  • with quality (ROUGE-L + METEOR) + length penalty rewards: 2.7/4 (wins again!)
  • with just length penalty: 2.23/4

Ranking of t-test for other rewards:

Summary Table

Reward Configuration Composite Faithfulness Coverage Conciseness Clarity Pass Rate
length-quality-meteor-rouge 2.769 0.832 0.511 0.659 0.767 44.3%
length-quality-bleu-rouge 2.732 0.810 0.502 0.650 0.770 39.1%
length-quality-meteor-bleu 2.664 0.792 0.468 0.648 0.756 38.3%
length-quality-rouge-l 2.555 0.725 0.415 0.637 0.778 32.4%
length-quality-meteor 2.484 0.721 0.427 0.625 0.711
length-quality-bleu 2.400 0.680 0.399 0.577 0.744 26.9%
length-only (baseline) 2.416 0.678 0.407 0.592 0.739 30.7%

>Performed on the test sample of 200 of smoltldr dataset. Baseline: length penalty only

All the code and wandb charts in the comments!

Setup: 3x Mac Minis in a cluster running MLX.

One node drives training using GRPO, two push rollouts via vLLM-metal framework. All of the work done using smolcluster.

Used SyncPS arch which is synchronous parameter server architecture with the master as the node where the training happens and the vllm on the workers nodes.

Eval:

LLM-as-a-Judge (gpt-5)

  • Used DeepEval to build a judge pipeline scoring each summary on 4 axes:

>Faithfulness — no hallucinations vs. source Coverage — key points captured Conciseness — shorter, no redundancy Clarity — readable on its own

The composite score is the mean of the above scores.

  • Reward system

>length_penalty : basically, -abs(response_length - MAX_LENGTH)

  • quality_rewards:

>ROUGE-L only cares about the longest common subsequence — it misses synonyms and paraphrases entirely.

>METEOR handles both: it aligns tokens with synonym matching via WordNet and balances precision + recall with a chunk-order penalty.

>BLEU on the other hand, focuses more on n-gram precision and length penalty.

reddit.com
u/East-Muffin-6472 — 25 days ago