u/Shoddy_5385

▲ 7 r/FinOps

engineering exports a giant CSV, finance asks why is AWS up 14% engineering scrolls horizontally for 20 mins, nobody walks away with an answer. Familiar?

Tried a Sankey instead. Provider -> Account -> Resource Type -> Team. band width = dollars. You see where money flows in 3 seconds.

What works:

  • eye finds the fat band immediately. tables make every row look equal even when one row is 90% of the bill.
  • month-over-month becomes which bands got fatter non-engineers can do that.
  • drill-in is a click, not a filter combo.

What doesn't:

  • bad tagging kills it. 60% untagged = giant grey blob and the CFO notices. Kinda useful tho, forces the tagging convo.
  • doesn't show change over time. Still need a line chart next to it.
  • harder to export for someone who wants to handedit in excel.

anyone built one in-house? What library we ended up on D3 after a few higher-level libs couldn't handle cycles or sub-band labels and does your finance team actually use it or just ask for the CSV anyway?

reddit.com
u/Shoddy_5385 — 7 days ago

been reading the pocketos incident and honestly this feels less like an “ai failure” and more like a basic setup issue

  • one api key could delete prod + backups
  • backups in the same environment as prod
  • no confirmation before destructive actions
  • ~30 hours downtime, some data gone

ai made it faster, but this was already one mistake away from happening feels like a reminder that early stage teams skip boring infra safety until something breaks

curious how other founders think about this:

  • do you actually separate backups from prod early on?
  • do you restrict delete permissions or just move fast?
  • where do you draw the line between speed vs safety?

trying to understand what’s realistic vs overkill at an early stage

reddit.com
u/Shoddy_5385 — 9 days ago
▲ 54 r/devops

been reading the pocketos incident and the takes feel off. everyone is focused on ai agent deleted a startup but if you remove the ai part, this is just infra failing hard:

  • one api key had delete access to prod + backups
  • backups were in the same railway env as prod
  • no confirmation step before destructive actions
  • ~30 hours of downtime, some data gone for good

this could’ve been a bad script, leaked key, or someone half-asleep running a command. the agent just did it faster.

the only actually new thing here is the transcript after. it literally says it ignored instructions, guessed instead of checking, then explains what it did. that part is wild.

if you’re running agents on prod right now without hard guardrails, this is on you what are people actually doing here?

read only by default? scoped creds per task? manual approval for deletes?or just hoping nothing goes wrong

trying to figure out what’s overkill vs what should already be standard after this.

reddit.com
u/Shoddy_5385 — 9 days ago