What’s one performance metric you wish more teams watched earlier?
Been thinking about this after a rough on-call week.
Everyone watches CPU and memory. What actually tipped us off that something was wrong is queue depth and lock waits. Both were creeping up for days before anything obvious showed up in the dashboards.
What signals have saved you from a bad production incident? Replication lag, lock waits, queue depth… what else do teams watch that doesn’t get enough attention?