Moved our data quality checks before the INSERT — here's what changed
Same pipeline breaking pattern for the third time in a row. Pipeline runs clean, dashboard is wrong in the morning. Someone upstream changed a field type, or a column started going null, or a new field showed up that nobody told us about. The existing checks (dbt tests, Great Expectations) only caught it after the data was already sitting in the warehouse. Two days of bad rows before anyone noticed.
I got tired of it.
We stuck a screening step between extract and load. Basically an API call that looks at the payload before it touches the database. Sends back PASS, WARN, or BLOCK depending on what it finds.
source → screen → PASS → load to warehouse
→ WARN → load + flag
→ BLOCK → dead letter queue
It checks for the usual stuff — null rates spiking, type mismatches (a field that was always numeric now has strings mixed in), schema changes (new fields, missing fields), duplicate rates, outlier counts. 18 checks total, single pass, comes back in under 10ms so it doesn't slow anything down.
The part that surprised me: the big wins aren't the obvious failures. Those you'd catch eventually. It's the slow drift. Null rates creeping up 2% per week. A field that's 97% numeric and 3% string. Nobody notices until it's been wrong for a month and your ML model has been training on garbage.
We baseline the schema on first run (SHA-256 fingerprint of field names + types), then compare every batch after that. Null rate baselines use exponential moving average so they adapt gradually. If your data legitimately changes over time, it doesn't keep firing. But if something jumps overnight, it catches it.
The tradeoff with EMA baselines: if your data has been broken for long enough, the baseline learns the broken state. Haven't fully solved that one yet. Manual reset works but it's not great.
Runs on Cloudflare Workers, so no data hits disk. Everything is in memory, only stores aggregate stats (fingerprints, null rates, type distributions).
Anyone else doing quality checks pre-storage? Most tooling I've seen is post-load. Curious if there's a pattern I'm missing or if everyone just lives with the "fix it after it breaks" approach.