
Built a production ETL pipeline on Kubernetes for MLOps End to End series, sharing the architecture and design thinking
Before training any model, you need clean data. Sounds obvious, but most MLOps content skips straight to the model. So I started my series with what actually comes first, a data pipeline.
Built a complete ETL pipeline that pulls real crypto market data from the Binance API and loads it into PostgreSQL. 2.28 million rows of structured OHLCV data. Runs on Kubernetes with Airflow using KubernetesExecutor.
Some design decisions that might be useful:
- Why Store data twice?
- Why KubernetesExecutor over CeleryExecutor?
- Why Parallel extraction?
- Why Pre-flight checks before ETL starts?
- Why Connection pooling for bulk loads?
Whole infra deploys with one command. Docker Compose option included for people without a K8s cluster.
I've also recorded a full live walkthrough on https://youtu.be/5HBeVZ7uMlg if you'd like to see it running end to end.
And of course for Patience Readers: Medium
Please let me know future topics where you want something related to ML Production scenarios