u/Ok_Abrocoma_6369

▲ 2 r/sre

AI agents to manage Spark jobs at prod scale, are they worth it?

Built an ETL pipeline in Airflow with BigQuery sinks and dbt models. Tests ran fine on synthetic data. Prod load is around 10TB daily. Skewed partitions and joins behave differently at that scale , slots get used up fast, queries slow down, costs go up.

Running on GCP with multi-region buckets, Pub/Sub, Dataflow, BigQuery. Partitioned by date, clustered on user ID. Real data has backfills and uneven writes so partition pruning mostly isn't working.
Tried increasing slots and reservations, just more cost. Nothing in the current stack adapts to what's happening at runtime. Started looking at AI agents to manage Spark jobs as a way to handle the detection and adjustment automatically instead of chasing issues manually after they hit prod.

Are AI agents to manage Spark jobs mature enough for a GCP setup at this scale now? What's worked for others dealing with the same prod vs synthetic data gap?

reddit.com
u/Ok_Abrocoma_6369 — 19 hours ago

AI agents to manage Spark jobs at prod scale, are they worth it?

Built an ETL pipeline in Airflow with BigQuery sinks and dbt models. Tests ran fine on synthetic data. Prod load is around 10TB daily. Skewed partitions and joins behave differently at that scale , slots get used up fast, queries slow down, costs go up.

Running on GCP with multi-region buckets, Pub/Sub, Dataflow, BigQuery. Partitioned by date, clustered on user ID. Real data has backfills and uneven writes so partition pruning mostly isn't working.
Tried increasing slots and reservations, just more cost. Nothing in the current stack adapts to what's happening at runtime. Started looking at AI agents to manage Spark jobs as a way to handle the detection and adjustment automatically instead of chasing issues manually after they hit prod.

Are AI agents to manage Spark jobs mature enough for a GCP setup at this scale now? What's worked for others dealing with the same prod vs synthetic data gap?

reddit.com
u/Ok_Abrocoma_6369 — 19 hours ago

is there a better way to track schema changes without silently breaking downstream reports?

we have dbt models pushing schema changes to prod pretty regularly but downstream reports and bi dashboards keep breaking silently. no alerts, just find out when someone complains a week later.

current setup is basic git history + dbt docs but that doesn't catch when a column rename or type change nukes a join in some forgotten looker dashboard. tried adding pre deploy checks with sql fluff but its too static, misses runtime impacts.

our team is small, 4 data engs handling 50+ models across prod/staging. leadership wants zero breakage but manually reviewing every pr is killing us.

anyone got a lightweight way to track this like dbt macros that flag downstream deps, or some schema diff tool that pings slack on breaks open source preferred since budget sucks. What've you seen work at scale without turning into a full ci nightmare?

curious how others avoid this treadmill.

reddit.com
u/Ok_Abrocoma_6369 — 20 hours ago

is there a better way to track schema changes without silently breaking downstream reports?

we have dbt models pushing schema changes to prod pretty regularly but downstream reports and bi dashboards keep breaking silently. No alerts, just find out when someone complains a week later.

current setup is basic git history + dbt docs but that doesn't catch when a column rename or type change nukes a join in some forgotten looker dashboard. tried adding pre deploy checks with sql fluff but its too static, misses runtime impacts.

our team is small, 4 data engs handling 50+ models across prod/staging. leadership wants zero breakage but manually reviewing every pr is killing us.

anyone got a lightweight way to track this like dbt macros that flag downstream deps, or some schema diff tool that pings slack on breaks open source preferred since budget sucks. what've you seen work at scale without turning into a full ci nightmare?

curious how others avoid this treadmill.

reddit.com
u/Ok_Abrocoma_6369 — 20 hours ago

Best data observability platform tools for data quality monitoring, lineage, and pipeline reliability.

We’re reviewing a bunch of vendors in the data reliability software space and trying to narrow things down. Quick thoughts so far:

monte carlo: strong enterprise presence, broad coverage across warehouses and bi tools, very polished but can feel heavy and expensive.

Bigeye: legacy stack support, decent anomaly detection, seems solid for teams not on modern data stack.

elementary: tbh stands out if you're running dbt, they seem to also have Python support. it’s deeply dbt native and seem easier to operationalize. great visibility into data health, freshness, and lineage without overwhelming onboarding. the AI agents look promising, setup is straightforward, and it feels more aligned with analytics engineering workflows instead of forcing a separate platform mindset.

anomalo: heavy focus on ml based anomaly detection, good for automated insights but may require tuning, and is very enterprise heavy.

metaplane: modern UI, focuses on column level monitoring and anomalies, decent balance between automation and control.

soda: flexible and developer friendly, works well if you want more hands on control.

great expectations: more framework than platform, powerful for custom validations but requires engineering effort to scale properly.

for teams that are dbt heavy and want something opinionated but not bloated, elementary feels less intrusive and more practical compared to some of the bigger enterprise suites.

curious what others would prioritize. Full automation and enterprise coverage, or tighter integration and lower operational overhead?

reddit.com
u/Ok_Abrocoma_6369 — 8 days ago