[Airflow 3.1.8] Postgres lock contention on task_instance with 150+ K8s workers
Hi everyone,
We are running Airflow 3 on KubernetesExecutor and hitting a scaling bottleneck.
The Problem:
Once we hit ~150 concurrent workers, we see heavy lock contention on the task_instance table.
- Specifically during SELECT ... FOR UPDATE (scheduler) and UPDATE (task state changes).
- DB wait events show high Lock:transactionid times.
Our Setup:
- Airflow 3.1.8
- Postgres + PGBouncer (Transaction mode)
- DB CPU/RAM usage is fine; the issue is purely row-level locking.
Has anyone else faced this at scale with Airflow 3? Are there specific scheduler configs or Postgres tuning you’d recommend to reduce this contention?
Thanks!