u/DiamondLatter1842

▲ 5 r/sre

Logs vs traces in production issue resolution, honest opinions?

We had an outage this week that went sideways because logs and traces were telling completely different stories.

Payments service has been flaky with intermittent 5xxs. I spent a while in CloudWatch logs and found what looked like a clear null pointer in validation right before the bank API call. Same pattern we'd seen before. Looked straightforward.

Pushed a small hotfix to handle it. Quick review, deploy went through, things looked stable for 20 minutes.

Then everything started timing out.

Switched to traces and the picture was completely different. Same request IDs showed failures in the downstream bank integration, not even hitting the validation path I had changed.

Turned out our tracing was heavily sampled. Logs had full volume, traces did not. So they weren't representing the same traffic. I fixed something that was not actually the issue.

We ended up chasing the wrong path for hours. Root cause was a timeout mismatch after a cert change on the bank side.

We have fixed it now, but most of the time loss was just due to following the wrong signal.

How others deal with this. When logs and traces disagree, what do you trust first, or how do you validate before acting?

reddit.com
u/DiamondLatter1842 — 1 day ago

We used to do storytime every night without a fight. Somewhere in the last year that completely flipped. Now it's a battle every single time. He pushes back on everything, wants his tablet, picks a fight over which book, stalls until I give up.

I have tried no screens before bed, earlier start times, letting him pick whatever he wants to read. None of it is consistent. Some nights it clicks and we get through a whole chapter. Most nights it is a standoff.

I cut his tablet time and that helped a little but now he is just more wound up by the time we get to reading. Like the frustration carries over.

How do I handle this situation?

reddit.com
u/DiamondLatter1842 — 7 days ago

This is starting to feel like a pattern and i don't know how to break it.

deploy goes out. ci passed, staging clean, diff looked reasonable. prod holds for a bit then something starts behaving wrong. not crashing, not throwing errors, just not doing what it's supposed to do. wrong calculations, unexpected branching, edge cases hitting paths that should never get hit.

The problem is all my observability is pointed at infrastructure. i know when cpu spikes, when memory climbs, when error rates move. i have no visibility into which paths the code actually takes in prod unless i manually add instrumentation, and by then i am adding it after the fact to debug something that already happened. Feels like there's a gap between "the system is healthy" and "the code is behaving correctly." metrics cover the first one. nothing i have covers the second.

what are you using for this in prod? is this just better tracing or is there a different category of tool that actually shows you what your functions are doing with real traffic?

reddit.com
u/DiamondLatter1842 — 15 days ago