At what point did your data start failing you in production?
One pattern I’ve been noticing across different AI/ML systems we’ve been building and deploying:
Things work fine early on with:
- curated datasets
- synthetic data
- small controlled test sets
But once systems hit real-world usage, a different class of problems shows up:
- edge cases that weren’t in the original data
- distribution shifts that quietly degrade performance
- workflows behaving differently than expected
- gaps in eval coverage that only show up over time
What’s interesting is that we often hit a point where everything looks fine structurally, but performance just isn’t reliable anymore.
For those who’ve run into this:
When did you realize your existing data wasn’t enough?
More importantly:
- what didn’t work when you tried to fix it?
- where did your data still fall short even after expanding it?
Trying to understand where this actually breaks down in practice.