Honestly, I’m done with the current "AI for SRE" hype.
Every time there’s a SEV-1, the routine is the same: Hop into the cloud console to grab logs, jump to an observability dashboard for metrics, and then dig through cryptic Jira tickets from months ago just to see if someone actually wrote a post-mortem the last time this broke.
It feels like we’re still just doing manual legwork, only now we have a chat box on the side. The "AI revolution" feels like a joke when we’re still spending 80% of our time just gathering context during a fire.
Our team started experimenting with a different logic to solve this, and I want to see if we’re overthinking it:
- The "Fetch" Approach: Instead of us hunting for data, we’re testing a model that uses read-only connectors to our K8s/Cloud accounts. The idea is to have the system fetch the relevant metrics, logs, and historical context the moment an incident is reported, before a human even opens the dashboard.
- Audit Trail over "Magic": We’ve moved away from black-box RCAs. We’re forcing the system to show every step—"I checked this metric, it was normal; I correlated this error log with a recent deployment." If it's wrong, you can see exactly where the logic tripped.
- Operational Memory: We’re trying to write every validated RCA back into a semantic network so the system "learns" the specific quirks of our infra, instead of starting from zero every time a node flaps.
I’m genuinely curious about your perspective from the trenches:
- Does this "automated context gathering" actually sound like it would save you time, or is it just more noise to filter through?
- And the big one: Would your SecOps team ever even consider giving an AI-native tool read-only access to your telemetry metadata, or is that a non-starter?
We’ve been dogfooding this logic with a small prototype we’re building, and while it works for us, I suspect production environments at scale might break it in ways I haven't considered. I’d love to hear some brutal technical critiques on this approach.