u/CompetitiveStage5901

We upgraded four GKE clusters from 1.27 to 1.28 two weeks ago. No workload changes, no node pool changes, same namespace structure. Our network egress bill jumped 40% across all four clusters overnight.

Digging into the billing export, I see Network Internet Egress from Americas to Americas SKU up 35% and Network Inter Region Egress up 50%. But nothing changed in our service mesh or ingress controllers.

Checked the usual suspects: north-south traffic through LoadBalancer services looks flat. No new external endpoints. VPC Flow Logs show the same source/destination pairs as before.

Then I noticed something: GKE 1.28 enables Container Network Interface (CNI) managed node prefixes by default on new node pools. Our node pools weren't new, but the upgrade might have rolled the feature anyway. That feature can cause additional control plane communication over the network interface, which might be getting billed as egress even within the same VPC.

Also looking at kube-proxy mode – 1.28 defaults to iptables but if you had ipvs before, the migration could change packet pathing.

Anyone else seeing this? Is there a metric in Prometheus (maybe container_network_transmit_bytes_total vs billing data mismatch) that proves this is a control plane overhead problem? I'd rather not rebuild all four clusters to test the node prefix theory.

reddit.com
u/CompetitiveStage5901 — 14 days ago

0.0225perhour.Butleave50idleALBsrunningforayearandthats9,855 just, "running", we didnt do anything with that.

Caught this recently in our annual WAR . A dev team created an ALB for testing, forgot about it, and it sat there routing zero traffic for 8 months. Billed the whole time.

To resolve it, we used ActiveConnectionCount and RequestCount. Engineers running WAR built a quick Lambda that runs weekly, checks every ALB for the last 7 days, and if both metrics sum to <10, it tags the ALB as "zombie" and emails the owner.

While at it, you can also take a look at NLB ActiveFlowCount..

Don't forget Classic ELBs , older accounts have them hidden. They don't even show up in the Load Balancer page by default.

Funny thing is it was a 60-odd line python script that saved $12k in the first month

reddit.com
u/CompetitiveStage5901 — 14 days ago
▲ 7 r/FinOps

We have a single NAT gateway shared across 20 dev namespaces in EKS. Also a single EKS control plane (obviously). The NAT gateway costs 0.045/GBprocessedplusthehourlyfee.Thecontrolplaneis0.045/GBprocessedplusthehourlyfee.Thecontrolplaneis0.10/hr.

Right now we just split it equally across all teams. But one team does 80% of the data transfer through NAT. Another team runs only two pods and barely touches it. The equal split feels unfair but tracking actual usage per pod or per namespace through VPC Flow Logs and tagging is a nightmare.

I tried using VPC Flow Logs + Athena to attribute NAT traffic by source private IP, then map IP to namespace. Works but the queries are slow and expensive. Also doesn't handle the control plane cost at all.

What's everyone else doing? Do you just accept shared costs as overhead? Or do you have a clean way to charge back per team for things that aren't naturally tagged?

reddit.com
u/CompetitiveStage5901 — 14 days ago
▲ 28 r/FinOps

Not kidding. I ran a script that lists every EC2 instance with its average CPU over the last 30 days. Found 23 instances under 5%. The oldest: a t2.micro running for 14 months, 0.2% CPU. It was a forgotten VPN jumpbox.

Then I checked unattached EBS volumes. 87 of them. Some from terminated instances that were deleted 2 years ago.

Then RDS snapshots older than 60 days. 400+.

None of this showed up in our monthly cost review because everyone was looking at "big numbers" of EC2 total, RDS total. No one drilled into the tail waste.

Wrote a 50-line Python script using boto3 to tag everything obsolete and send a Slack webhook. Took 2 hours. Automated it weekly.

Now we save ~$16k/month. Literally just turning off and deleting stuff no one needed.

The lesson: before you buy Savings Plans or commit to anything, hunt the low-hanging zombie resources. They're everywhere.

reddit.com
u/CompetitiveStage5901 — 14 days ago