AI agents to manage Spark jobs at prod scale, are they worth it?
Built an ETL pipeline in Airflow with BigQuery sinks and dbt models. Tests ran fine on synthetic data. Prod load is around 10TB daily. Skewed partitions and joins behave differently at that scale , slots get used up fast, queries slow down, costs go up.
Running on GCP with multi-region buckets, Pub/Sub, Dataflow, BigQuery. Partitioned by date, clustered on user ID. Real data has backfills and uneven writes so partition pruning mostly isn't working.
Tried increasing slots and reservations, just more cost. Nothing in the current stack adapts to what's happening at runtime. Started looking at AI agents to manage Spark jobs as a way to handle the detection and adjustment automatically instead of chasing issues manually after they hit prod.
Are AI agents to manage Spark jobs mature enough for a GCP setup at this scale now? What's worked for others dealing with the same prod vs synthetic data gap?