u/AhmedMostafa16

▲ 0 r/sre

Not a promotion. I'm genuinely trying to validate whether this is worth building. Honesty is appreciated.

When an alert fires across multiple services, the room splits before the investigation even begins, with engineers opening different dashboards and coming up with different theories. I'm wondering if there's a better way to eliminate that alignment phase entirely.

The idea: an open source SDK that records events as your services handle requests continuously, before any alert fires. When an alert arrives, whether via PagerDuty, direct webhook, or a Slack command, it assembles the causal chain from data already captured in that window. By the time the war room opens, the artifact is ready. Everyone reads the same thing.

A 500 on order-service, traced back through api-gateway ➡️ pricing-service ➡️ inventory-service ➡️ notification-service. Five services, full causal chain, zero gaps. Assembled before the war room opened.

The artifact would be deterministic. Each step names its predecessor explicitly. Every event is recorded at runtime by the SDK. No inference, no probabilistic correlation, no AI slop. If a service wasn't instrumented, that gap appears honestly in the chain, labelled and explained. It shows what it knows and does not speculate about what it doesn't.

Installing the SDK on one service would immediately surface dependencies that your service calls and have no SDK installed. It would observe every outbound call and check whether the target is instrumented. I suspect most teams have at least one service they didn't know was calling them. A coverage scorecard could rank ghost services by call volume so you know where to instrument next.

For teams on Kubernetes, a cluster operator (one Helm install) could watch pod crashes, OOM kills, evictions, node pressure, HPA scaling, and deploy rollouts and map them into the same causal chain as your application traces. Read-only ClusterRole, never mutates your resources.

The intent is not to replace Datadog or Grafana. It's to be a precursor. You read this first, then go to your existing tools, knowing exactly what you are looking for.

Does this solve a real pain point for you and your team? What would make you actually adopt something like this?

reddit.com
u/AhmedMostafa16 — 12 days ago