u/Heavy_Banana_1360

▲ 0 r/golang

CVE reduction gone wrong: 2GB container images deployed and audited in production

Our security team decided to tackle our CVE backlog by building minimal container images. Minimal ended up meaning strip everything, then add it all back when builds started failing. We shipped 2GB images to production last month.

A compliance auditor showed up yesterday for a routine check and asked why our container images were the size of small VMs. I had to explain to our CTO why our CVE reduction effort tripled deployment bandwidth and made our security posture look worse on paper than before we started.

We didn't catch it ourselves because everything worked. Images deployed, services ran, CVE numbers went down. Nobody checked actual image size because that wasn't the metric we were watching. The debug utilities and build dependencies that crept back in during troubleshooting just stayed there.

Pull times went from 2 minutes to 8. That showed up in deploy metrics but we blamed the registry.

The thing I keep coming back to is that we had no automated check on image composition after the build. CVE count was the only signal we were watching and it told us we were fine.

Has anyone actually solved the image composition validation problem in CI? Something that catches bloat before it gets to production, not just CVE count.

reddit.com
u/Heavy_Banana_1360 — 1 day ago
▲ 0 r/sre

CVE reduction gone wrong: 2GB container images deployed and audited in production

Our security team decided to tackle our CVE backlog by building minimal container images. Minimal ended up meaning strip everything, then add it all back when builds started failing. We shipped 2GB images to production last month.

A compliance auditor showed up yesterday for a routine check and asked why our container images were the size of small VMs. I had to explain to our CTO why our CVE reduction effort tripled deployment bandwidth and made our security posture look worse on paper than before we started.

We didn't catch it ourselves because everything worked. Images deployed, services ran, CVE numbers went down. Nobody checked actual image size because that wasn't the metric we were watching. The debug utilities and build dependencies that crept back in during troubleshooting just stayed there.

Pull times went from 2 minutes to 8. That showed up in deploy metrics but we blamed the registry.

The thing I keep coming back to is that we had no automated check on image composition after the build. CVE count was the only signal we were watching and it told us we were fine.

Has anyone actually solved the image composition validation problem in CI? Something that catches bloat before it gets to production, not just CVE count.

reddit.com
u/Heavy_Banana_1360 — 1 day ago

Just caused a 2 hour production outage because our alerts are total garbage and I trusted them.

We have a monitoring setup with Datadog and PagerDuty thats supposed to catch everything but its so flooded with noise from every little blip that nobody pays attention anymore. Alerts dont help they just create noise like everyone says but I thought I was smarter.

Today during a deploy I see the usual flood of low priority pings about CPU spikes on some noncritical services. I glance at them think oh standard alert storm ignore and proceed with the rollout. Database connection pool starts acting weird but its buried under 50 other yellow warnings about latency blips from a promo traffic spike. No critical fires no red alerts just the normal chaos.

A few minutes later everything grinds to a halt. Production database fully wedged because the deploy flipped a config that exhausted the pool entirely. Users screaming orders failing payments down across three regions. Whole team wakes up in panic mode digging through logs while the alert backlog is thousands deep.Turns out the one alert that mattered was throttled and demoted because we cranked sensitivities way down last month to stop the 300am firehoses. I literally watched the deploy metric climb to doom and dismissed it as noise. Two hours to rollback manually because the auto rollback got silenced too in the noise reduction. Boss is furious but understanding ish since its a team problem but I feel like an idiot. We lost real revenue and trust. How do you even fix alert fatigue when its this bad? Anyone else triggered a disaster ignoring the spam? Please tell me Im not alone and give advice before I quit.

reddit.com
u/Heavy_Banana_1360 — 7 days ago

Feels like my team is stuck doing “busy work” instead of actual IT

I've been in IT close to 9 years and right now we're supporting a ~520 employee company across finance, ops and customer support. We have a decent setup, good budget, modern tools, nothing is fundamentally broken but day to day work has slowly turned into something I don't recognize anymore.

Most of our time is spent on:

Clearing ticket queues (120-150 per week)

Fixing the same recurring issues across departments.

Chasing access requests and permission mismatches.

Responding to alerts that don't really tell us what's wrong

We still talk about bigger projects in meetings, improving infrastructure, tightening security, optimizing workflows… but no one actually has time to execute, even our strongest engineers are stuck doing repetitive support work just to keep things moving. One of my teammates said last week "I didn't get into IT to reset passwords and chase tickets all day" and honestly that hit. It's not the hours that are killing motivation, it's the lack of meaningful work.

For those in larger orgs, how do you actually free up your team to focus on real engineering again?

reddit.com
u/Heavy_Banana_1360 — 9 days ago