u/Extension_Key_5970

Before training any model, you need clean data. Sounds obvious, but most MLOps content skips straight to the model. So I started my series with what actually comes first, a data pipeline.

Built a complete ETL pipeline that pulls real crypto market data from the Binance API and loads it into PostgreSQL. 2.28 million rows of structured OHLCV data. Runs on Kubernetes with Airflow using KubernetesExecutor.

Some design decisions that might be useful:

Why Store data twice?
Why KubernetesExecutor over CeleryExecutor?
Why Parallel extraction?
Why Pre-flight checks before ETL starts?
Why Connection pooling for bulk loads?

Whole infra deploys with one command. Docker Compose option included for people without a K8s cluster.

I've also recorded a full live walkthrough on https://youtu.be/5HBeVZ7uMlg if you'd like to see it running end to end.

And of course for Patience Readers: Medium

Please let me know future topics where you want something related to ML Production scenarios

Got asked "design a data ingestion pipeline for an ML team that needs daily data from 3 external APIs" in a system design round.

Sharing my approach.

Ask clarifying questions first. Most candidates skip this and start drawing immediately. But every answer below changes the design:

JSON vs streaming vs flat files? Changes the entire ingestion layer.
5 GB/day vs 50 GB vs 1 TB? Python + PostgreSQL vs Spark vs full data lake with Delta Lake/Iceberg.
Real-time vs daily batch? Kafka + Flink vs a scheduled Airflow DAG. Massive complexity difference.
One team vs twenty? Simple DB vs access control, data catalogue, feature store.

I assumed: structured JSON, 5-10 GB/day, daily batch, single team, Kubernetes available.

The pipeline:

3 API sources → Airflow (KubernetesExecutor, one pod per task) → parallel extraction → raw JSON stored in MinIO untouched → transform (clean, cast, validate) → PostgreSQL.

Key pattern: store raw and processed separately. Transform logic has a bug? Fix code, reprocess from raw. No re-fetching from APIs. Interviewer asks, "Reprocess last month?" --> You have an answer.

Production concerns that matter:

Exponential backoff on retries (1 min, 5 min, 15 min)
Idempotency: re-running the same date must not create duplicates (upsert, partition overwrite, or staging table merge)
Data quality checks after every load — null counts, row counts, duplicates
Backfill support from raw storage

Mistakes I have seen (and made):

Saying "I would use Kafka" before knowing volume or freshness
No raw storage layer = no reprocessing ability
Only describing the happy path, never mentioning failures
Over-engineering a single-team problem with Spark Streaming and data mesh

Actually built this pipeline on Kubernetes with real Binance API data. Code: github.com/var1914/mlops-boilerplate

Full visual walkthrough on YouTube

Built a production ETL pipeline on Kubernetes for MLOps End to End series, sharing the architecture and design thinking

How I approach MLOps system design questions in interviews: sharing the thinking, not just the diagram