Logs vs traces in production issue resolution, honest opinions?
We had an outage this week that went sideways because logs and traces were telling completely different stories.
Payments service has been flaky with intermittent 5xxs. I spent a while in CloudWatch logs and found what looked like a clear null pointer in validation right before the bank API call. Same pattern we'd seen before. Looked straightforward.
Pushed a small hotfix to handle it. Quick review, deploy went through, things looked stable for 20 minutes.
Then everything started timing out.
Switched to traces and the picture was completely different. Same request IDs showed failures in the downstream bank integration, not even hitting the validation path I had changed.
Turned out our tracing was heavily sampled. Logs had full volume, traces did not. So they weren't representing the same traffic. I fixed something that was not actually the issue.
We ended up chasing the wrong path for hours. Root cause was a timeout mismatch after a cert change on the bank side.
We have fixed it now, but most of the time loss was just due to following the wrong signal.
How others deal with this. When logs and traces disagree, what do you trust first, or how do you validate before acting?