u/Such_Rhubarb8095

24 7 support at scale sounds great until your team hits its limits

Traffic spikes at 3am because teams across asia are waking up, after hours tickets start coming in from europe, and a small team of 12 is expected to keep everything stable without things breaking down. response times look fine at first, but after a day or two they start slipping because people simply cant sustain constant coverage without breaks.

leadership talks a lot about decoupling support capacity from headcount, usually followed by some push toward automation or ai that is supposed to absorb the load. in reality it often just shifts the work somewhere else, or adds another layer of things to monitor when something goes wrong.

weve tried scaling with contractors during peaks, but they dont stay consistent when demand is actually high. self service tools help in theory, but users still escalate everything when it gets even slightly unclear. dashboards always promise smooth infinite scale, but the real world still needs people to step in when things break.

how are teams actually handling 24 7 global support at scale without burning out or everything collapsing behind the scenes?

reddit.com
u/Such_Rhubarb8095 — 22 hours ago

We have been absolutely drowning in password reset requests. I am talking 500 a week across our 2000 person organization. Same template response every single time. So I built what I thought was a clever automation in our ticketing system to detect incoming password resets by keyword and auto respond with our standard troubleshooting steps and link to the self service portal. Seemed foolproof. I tested it on a few tickets in dev environment and it worked perfectly. Deployed it the next morning feeling pretty good about finally solving a major pain point for my team. We were going to save maybe 30 hours a week of repetitive work.

By 11 AM we were getting angry emails. By noon my manager was pulling me into a call. The automation was matching on any ticket that contained the word reset. We had finance tickets about password resets in old systems. HR tickets about employees resetting their start dates. Facilities tickets about reset procedures for building access.

Also the automation was also overwriting the original ticket content with a generic troubleshooting response. So the actual problem statement was getting lost. Support team could not figure out what people needed because the real ticket body was gone. We had to manually restore like 150 tickets yesterday.

I had to turn it off within an hour but the damage was done. We spent all day cleaning up the mess and dealing with pissed off end users and departments. My manager was surprisingly chill about it but I feel like absolute garbage. I genuinely thought I was helping. Should have tested more carefully instead of rushing it.

Anyone else done something this stupid with automation?

reddit.com
u/Such_Rhubarb8095 — 9 days ago

So upper management comes in last month all excited about this 98% automated AI triage thing. intelligent agents that categorize prioritize and route tickets instantly based on intent and urgency. sounds great on paper right no more manual gatekeeper role just pure efficiency. i was like sure lets give it a shot anything to cut down on the endless password reset spam.

week one tickets drop 70 percent. team high fives all around. week two we notice the pattern. ai is ruthlessly triaging everything to self service or closing low prio stuff outright. users freak because their emergency wifi issue gets auto closed as known problem with a link to reboot instructions. meanwhile real fires like prod outages sit in limbo because ai deems them medium based on keywords.

now half our day is undoing ai mistakes. false positives on urgency routing critical crm bugs to tier 1 bots that just spit back generic kb articles. tier 1 agents twiddling thumbs because ai swallowed all the simple stuff but left them the edge cases they arent trained for. and the best part the 2% that needs humans, thats all the weird shit that breaks slas because no one touches it for days.

eliminating the gatekeeper sounded smart until you realize ai makes payroll decisions too apparently. my queue is ai rejects and escalations from pissed off users who hate talking to bots. feels like we traded a human bottleneck for a silicon one that lies about priorities.

anyone running 98% ai triage without wanting to unplug it all?

reddit.com
u/Such_Rhubarb8095 — 10 days ago

been dealing with this at work and its driving me nuts. we run scans every week with one of the big name tools, get flooded with high CVSS scores, patch what we can, but then bam, something critical slips through and we get hit. last month it was a vuln nobody prioritized because it wasn't top score, but attackers had exploits ready.

makes me wonder if we're relying too much on scores and not thinking enough about whether something is actually being targeted. anyone else seeing this? whats actually working for you to catch the stuff that matters before its too late — switching tools or is it the process?

reddit.com
u/Such_Rhubarb8095 — 15 days ago

Hi y'all, so we're an internal IT team supporting 520 users across 8 departments (finance, ops, support, etc.) with 480 endpoints total. Current setup NinjaOne for monitoring and patching.

Separate ticketing system (about 135 tickets/week on average)

Reality lately:

130-150 tickets/week, spikes to 180 during patch cycles.

Patch rollouts overlapping with live user issues almost every time.

Same device triggering multiple alerts (CPU, patch failure, service issue) no clear priority

Tickets being created manually even when alerts already exist in NinjaOne.

Users reporting issues 10-20 mins before anything shows up in monitoring.

The biggest issue is fragmentation. Real example from yesterday (Finance team laptop) Patch fails during rollout. Flagged in NinjaOne then 12 minutes later user submits a ticket (app not opening) Tech picks it up has to check ticket system, then jump into NinjaOne, then remote in meanwhile another alert fires for the same device (service restart failed)

So now 1 issue multiple alerts, 1 ticket and 3 different places to piece it together. Multiply that by 140 tickets/week and it becomes pure context switching all day. We've tried tightening alert thresholds, scheduling patch windows differently and improving ticket tagging It helps a bit, but doesn't solve the core issue: everything lives in separate layers. We've even looked into consolidating tools platforms like Kaseya came up internally but not convinced switching just replaces one type of friction with another. At this point it feels less like a tooling issue and more like the workflow itself isn't built for this scale.

How are other enterprise teams handling this???

reddit.com
u/Such_Rhubarb8095 — 17 days ago