I was tired of my team’s prompts breaking every other week, so I systematized how we version and validate them
We went from 3 people using Claude to 12. Suddenly "the prompt that worked Tuesday" was giving garbage on Wednesday, and nobody knew who changed what.
I got tired of treating prompts like chat history instead of deployable code. So I built a lightweight framework that does three things:
- **Catches structural failures before they reach users**
If a prompt is supposed to return 5 sections and the model only gives 3, it stops there instead of us finding out 3 days later.
- **Flags when the model drops in quality mid-conversation**
We have a baseline for what "good output" looks like. If it drifts below that, the system either retries or escalates to a human.
- **Locks shipped versions**
Once a prompt works, we freeze it. If someone tweaks it later, we know exactly what changed and can roll back in seconds.
None of this is magic—it's just treating prompts with the same discipline as API contracts. I built it because I needed it. If you're running into the same "it worked last month" chaos, the approach is open.
Happy to share the high-level structure if anyone wants it.