u/MembershipUnited5355

CI deploy fails due to IAM drift. the Sr Engineer finds it, fixes it, closes ticket.

Weeks later, different service, different engineer, same root cause shows up again.

Not asking about tooling or documentation. here… assume monitoring + CI/CD gates already exist and log a lot of this “class” of problems.

What I’m curious about is :

When you moved prevention earlier (CI / deploy / monitoring gates), how did you decide what stays as “incident-time knowledge” vs what gets promoted into a hard pre-deploy check?

For example: if an IAM drift issue is discovered during a deploy, do you treat the fix as:

something you only add to the postmortem/runbook (so next engineer still has to recognize it), or As something that becomes a CI gate like “fail deploy if IAM policy diff ≠ baseline”?

Something I keep noticing after production incidents:

The fix gets merged, the immediate issue is resolved, and everyone moves on.

A few months later, a very similar failure happens again. Different symptoms, same underlying cause. The team ends up re-deriving the same debugging path from scratch because the useful part of the last incident never really became operational knowledge.

Sometimes there’s a runbook, but it explains what happened instead of what to check first next time. Sometimes the context behind a mitigation or alert threshold only exists in someone’s head.

Feels like less of a monitoring/tooling issue and more of a “decision memory” issue.

For teams that are actually good at reducing repeat debugging effort: what concretely changes after an incident? Not asking about tools so much as process, habits, ownership, review steps, escalation flow, etc.

IAM drift keeps recurring… when do you turn a fix into a CI gate vs leave it as a runbook note?

What’s one concrete change that made repeat incidents cheaper to diagnose instead of re-learning the same root cause each time?