IAM drift keeps recurring… when do you turn a fix into a CI gate vs leave it as a runbook note?
CI deploy fails due to IAM drift. the Sr Engineer finds it, fixes it, closes ticket.
Weeks later, different service, different engineer, same root cause shows up again.
Not asking about tooling or documentation. here… assume monitoring + CI/CD gates already exist and log a lot of this “class” of problems.
What I’m curious about is :
When you moved prevention earlier (CI / deploy / monitoring gates), how did you decide what stays as “incident-time knowledge” vs what gets promoted into a hard pre-deploy check?
For example: if an IAM drift issue is discovered during a deploy, do you treat the fix as:
something you only add to the postmortem/runbook (so next engineer still has to recognize it), or As something that becomes a CI gate like “fail deploy if IAM policy diff ≠ baseline”?