We run a mix of AWS and GCP across a few teams and every month there’s some surprise spike from instances or clusters that got scaled up and never came back down.
Right now we rely on basic alerts like CPU thresholds, but that’s too late. By the time something triggers, the cost is already there.Trying to figure out how to catch this earlier, not just after the fact, but at the point where something is being overprovisioned or scaled incorrectly.
we looked at a few tools, but they feel heavy for what we need and don’t really solve the underlying issue.
What’s actually working for you to catch overprovisioning early without constant manual tracking?