u/PassionateBuilder-09

Moved our data quality checks before the INSERT — here's what changed

Same pipeline breaking pattern for the third time in a row. Pipeline runs clean, dashboard is wrong in the morning. Someone upstream changed a field type, or a column started going null, or a new field showed up that nobody told us about. The existing checks (dbt tests, Great Expectations) only caught it after the data was already sitting in the warehouse. Two days of bad rows before anyone noticed.

I got tired of it.

We stuck a screening step between extract and load. Basically an API call that looks at the payload before it touches the database. Sends back PASS, WARN, or BLOCK depending on what it finds.

source → screen → PASS → load to warehouse
               → WARN → load + flag
               → BLOCK → dead letter queue

It checks for the usual stuff — null rates spiking, type mismatches (a field that was always numeric now has strings mixed in), schema changes (new fields, missing fields), duplicate rates, outlier counts. 18 checks total, single pass, comes back in under 10ms so it doesn't slow anything down.

The part that surprised me: the big wins aren't the obvious failures. Those you'd catch eventually. It's the slow drift. Null rates creeping up 2% per week. A field that's 97% numeric and 3% string. Nobody notices until it's been wrong for a month and your ML model has been training on garbage.

We baseline the schema on first run (SHA-256 fingerprint of field names + types), then compare every batch after that. Null rate baselines use exponential moving average so they adapt gradually. If your data legitimately changes over time, it doesn't keep firing. But if something jumps overnight, it catches it.

The tradeoff with EMA baselines: if your data has been broken for long enough, the baseline learns the broken state. Haven't fully solved that one yet. Manual reset works but it's not great.

Runs on Cloudflare Workers, so no data hits disk. Everything is in memory, only stores aggregate stats (fingerprints, null rates, type distributions).

Anyone else doing quality checks pre-storage? Most tooling I've seen is post-load. Curious if there's a pattern I'm missing or if everyone just lives with the "fix it after it breaks" approach.

reddit.com
u/PassionateBuilder-09 — 7 hours ago

Using Cloudflare Workers as a “data firewall” layer — curious if others do something similar

I’ve been experimenting with using Cloudflare Workers as a lightweight “quality gate” in front of data ingestion pipelines, and I’m curious if anyone else here has tried something similar.

The problem I kept running into:
upstream APIs would silently change a field type or start returning unexpected nulls, and my pipeline would ingest it without complaint. The failure wouldn’t show up until hours later in dashboards or ML jobs.

I wanted something that sits before the INSERT and screens every batch in real time.

Workers ended up being a surprisingly good fit because of:

  • sub‑10ms cold starts
  • no filesystem access (nice privacy boundary)
  • KV for fast schema hash caching
  • Durable Objects for consistent baseline state
  • D1 for lightweight control‑plane metadata
  • Queues for overflow / async processing
  • global edge execution close to the data source

The flow looks like:

Code

Source → Worker → PASS/WARN/BLOCK → Pipeline

Inside the Worker I’m doing:

  • deterministic sampling
  • single‑pass column analysis
  • schema fingerprinting
  • drift detection (type changes, null spikes, enum changes, etc.)
  • health scoring
  • returning a simple verdict to the caller

Then async updates happen via waitUntil() to DO + D1.

It’s been surprisingly effective as a “data firewall” layer, but I’m curious how others approach this.
Has anyone else used Workers for pre‑ingestion validation or data screening?
Or do you handle this kind of thing deeper in the pipeline?

Would love to hear patterns or pitfalls from folks who’ve tried similar architectures.

reddit.com
u/PassionateBuilder-09 — 2 days ago