u/samehmeh

Liveness probes sharing the cpu-bound thread pool keep killing your pods

Here's a fictional story:

An SRE at a B2B SaaS company watched the same payment service get killed every ninety seconds for two days. Liveness probes timing out. CPU pinned at 100 percent processing a backlog. Kubernetes did exactly what it was told, restart the pod, which dropped the in-flight work, which made the backlog worse. Classic restart loop dressed up as a health problem.

The team tuned timeoutSeconds up. Then failureThreshold. Then initialDelaySeconds. Probe still failed. Because the probe was riding the same CPU-bound worker pool as the real traffic. Under load, it always loses.

Here's the pattern I see people miss.

Probes are not free. A liveness probe is just an HTTP call your app has to answer. If the handler shares the thread pool, goroutine pool, or event loop with the work that's saturating CPU, the probe gets queued behind real requests. Bumping the timeout doesn't fix queueing, it just delays the inevitable kill.
Split the probe onto a dedicated lightweight handler. Different port, different listener, different executor. In Go that means a second http.Server on its own goroutine. In Java that means a separate Jetty connector with its own thread pool. In Node, a worker thread or at minimum a handler that does zero awaitable work. The probe should answer in single-digit milliseconds even when the main app is buried.

# main API on 8080, liveness on 8081 with its own server
livenessProbe:
  httpGet:
    path: /healthz
    port: 8081
  periodSeconds: 10
  timeoutSeconds: 1
  failureThreshold: 3

Liveness and readiness do different jobs. Liveness answers "is this process wedged, restart me." Readiness answers "should I receive traffic right now." A pod chewing through a backlog is alive but not ready. Mark it NotReady so the Service stops sending new requests, but keep liveness green so Kubernetes doesn't murder it mid-batch. Two probes, two endpoints, two questions.
Liveness should test the process, not its dependencies. If /healthz calls Redis, Postgres, and three downstream APIs, you've built a network outage detector that uses pod restarts as its alert mechanism. When the database hiccups, every replica fails liveness at once and the cluster restarts your entire fleet. The liveness handler should return 200 if the process can serve a trivial request. That's it.

// liveness: dumb and fast
mux.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
})

// readiness: actual dependency checks
mux.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
    if !db.Ping() || queueDepth() &gt; maxBacklog {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
})

Watch for restart loops in your metrics, not your logs. kube_pod_container_status_restarts_total going up while container_cpu_cfs_throttled_seconds_total is also climbing is the signature of this exact bug. If you see both, your probe is competing with your work for CPU and losing. Don't tune timeouts. Move the probe.

Done right: liveness is a separate listener on a separate port doing nothing but returning 200. Readiness is where the dependency logic lives. Restarts become rare events that mean something is actually broken, not a Tuesday afternoon.

Curious whether anyone has war stories from the "we made liveness check the database" era.

reddit.com

u/samehmeh — 2 days ago

▲ 0 r/mlops

Stop treating devops as a prerequisite for mlops

A junior engineer at a fintech told her manager she wanted to do MLOps, so she spent four months grinding Terraform, Ansible, and Kubernetes before touching a single model. When the platform team finally handed her a training pipeline, she froze. She knew how to provision a GPU node but had never run mlflow run or shipped a model artifact end to end.

This is the trap. People treat MLOps like a sequel to DevOps, like you have to finish one before you unlock the other. You don't. The DevOps subset that actually matters for MLOps is small, and you can pick it up next to the ML work instead of before it.

Here's what I see people get wrong.

Treating DevOps as a prerequisite gate.

Four months of Terraform modules and Ansible roles will not teach you what a feature store is, why training-serving skew eats models alive, or how to version a dataset. You can be world-class at provisioning a GKE cluster and still have no idea how to get a model from a Jupyter notebook into a registry. The gate is imaginary. Walk through it.

Skipping the ML side entirely and calling infra work "MLOps."

The opposite failure mode. Someone wires up Argo Workflows, slaps an mlflow container into it, and writes "MLOps Engineer" on LinkedIn. Then a data scientist asks why the model drifts after two weeks in production and the conversation stops. If you cannot reason about data drift, label leakage, or why your offline metrics disagree with online metrics, you are doing platform work with ML-flavored YAML. Real MLOps requires understanding what the pipeline is moving, not just that it moves.

Learning tools in isolation instead of shipping one pipeline end to end.

The fastest way to find your gaps is to build the smallest possible real pipeline:

data ingest -&gt; training job -&gt; model registry -&gt; serving endpoint -&gt; monitoring

Pick boring tools. dvc or plain S3 versioning for data. A scikit-learn or PyTorch model trained in a container. mlflow for tracking and registry. FastAPI behind a Docker image for serving. Prometheus scraping inference latency and prediction distributions. Run it locally with docker-compose first, then move one piece at a time to a real cluster.

The minute you try to ship that, the DevOps subset you actually need shows up on its own. You will need a CI job that builds the training image. You will need a way to promote a model from staging to prod without copy-pasting artifacts. You will need infra as code so the serving deployment is reproducible. None of that needs four months of upfront study. It needs one pipeline that is bothering you.

Confusing depth of tooling with depth of understanding.

Knowing every flag in kubectl does not make you better at MLOps. Knowing why your training job OOMs at batch size 64 but not 32, why your tf.data pipeline starves the GPU, why your model's p99 latency spikes when the feature service cold-starts, that makes you better at MLOps. Tools are interchangeable. The reasoning is not.

Waiting to feel "ready."

Nobody feels ready. The data scientists you will work with do not feel ready to debug a Kubernetes networking issue. The platform engineers do not feel ready to explain why a model's AUC dropped. MLOps is the seam between two fields where everyone is slightly out of their depth, and the people who get good at it are the ones who shipped something ugly first and cleaned it up after.

Done right: pick one model, one dataset, one serving target, and get it running end to end this month. The DevOps gaps will name themselves as you go, and you will learn them in the context where they matter instead of in the abstract.

What is the smallest pipeline you have actually shipped end to end, and where did it break first?

reddit.com

u/samehmeh — 2 days ago

▲ 5 r/platformengineering+1 crossposts

Golden paths should translate kubernetes errors at the boundary

A junior engineer at a fintech tried to ship a service through the company's golden path. The deploy failed. The platform spit back a forty-line Kubernetes event chain about admission webhooks, CEL evaluation, and a missing label selector. Three hours later, a senior on the platform team translated it: the pod template was missing one annotation.

That's not a developer skill issue. That's a platform bug.

Here's what I see people get wrong about golden paths. They expose the raw cluster errors, call it transparency, and assume the developer will figure it out. Real golden paths catch those stack traces at the boundary and rewrite them into something a human can act on.

A few things that separate a real golden path from a thin wrapper:

Translate errors at the boundary, not in Confluence.

If your validating webhook rejects a deploy because a team label is missing, the developer should never see the webhook name. The platform should catch the rejection and surface:

Error: deployment.yaml is missing the required `team` label.
Add it under spec.template.metadata.labels and re-run `platform deploy`.

Not:

admission webhook "vpod.kb.io" denied the request: ValidatingAdmissionPolicy 
'require-team-label' with binding 'require-team-label-binding' denied request: 
expression 'has(object.spec.template.metadata.labels.team)' evaluated to false

The second one is correct. It's also useless to a service developer who has never read a CEL expression in their life.

Build an error catalog, treat it like product copy.

Every rejection your platform can produce should map to a short, actionable message. Keep it in code, not a wiki. Something like:

var errorMap = map[string]PlatformError{
    "require-team-label": {
        Message: "Missing `team` label on deployment.",
        Fix:     "Add `team: &lt;your-team&gt;` under spec.template.metadata.labels.",
        Docs:    "platform.internal/errors/team-label",
    },
}

When the webhook fires, the CLI looks up the policy name and prints the friendly version. Raw event stays in the logs for the platform team. Developers get the one line they need.

Validate locally before the cluster ever sees it.

Half the forty-line errors should never reach kube-apiserver. Run the same OPA, Kyverno, or CEL policies in platform validate so the developer gets the rewritten error in two seconds at their laptop, not three minutes into a CI job.

$ platform validate
✗ deployment.yaml:12 missing required label `team`
  fix: add `team: payments` under spec.template.metadata.labels

Same rules, same source of truth, faster feedback loop.

Measure time-to-decode, not just deploy success rate.

Track how long it takes a developer to go from a failed deploy to a fixed deploy. If the median is over ten minutes, your error messages are the bottleneck, not your pipeline. Sit with a developer for an afternoon and watch them hit a real failure. The list of things to fix writes itself.

Stop calling raw errors "transparency."

Exposing the underlying primitives is fine for the platform team's debug mode. It is not a feature for application developers. Transparency is "here is exactly what to do." Dumping a Kubernetes event chain is the opposite, it pushes your job onto someone who does not have the context to do it.

Done right, a golden path looks boring from the outside. Deploy works, or you get one sentence telling you what to change. The forty-line stack traces still exist, they just live in the platform team's logs where they belong.

What error messages from your internal platform have you had to decode this month, and which one would you rewrite first?

reddit.com

u/samehmeh — 3 days ago

▲ 38 r/Terraform

1. Provider configs inside modules

# modules/vpc/main.tf, don't do this                                                                                                                                            
provider "aws" {                  
  region = "us-east-1"                 
  assume_role { role_arn = "arn:aws:iam::123:role/foo" }                                                                                                                        
}

Works in one account, breaks the second you point it at another. Modules should declare required_providers and let the root pass them in via providers = { aws = aws.workload }. A module shouldn't know what account it's running against.

2. count on resources with identity

resource "aws_iam_user" "this" {  
  count = length(var.users)                                                                                                                                                     
  name  = var.users[count.index]
}

Remove a user from the middle of the list, every user after it gets destroyed and recreated because indexes shift. Use for_each over a set/map, keyed by name, reorder freely:

resource "aws_iam_user" "this" {                                                                                                                                              
  for_each = toset(var.users)                
  name     = each.key                                                                                                                                                           
}

count is only safe for the var.enabled ? 1 : 0 toggle pattern.

3. Outputs that return whole resource objects

output "bucket" { value = aws_s3_bucket.this }

Feels convenient. Six months in you can't change anything on that bucket without breaking three downstream modules pulling random attributes. The output block is a contract.
Export attributes, not resources:

output "bucket_arn" { value = aws_s3_bucket.this.arn }
output "bucket_id"  { value = aws_s3_bucket.this.id }

4. Implicit cross-module deps

Module A creates an IAM role. Module B attaches a policy by hardcoded name. First apply, B sometimes runs before A finishes. Retry works, nobody investigates.

Terraform builds the graph from references. No reference = no edge = ordering by luck. Pass the ARN through as a real input, or add depends_on = [module.iam] on the consumer.

Done right: thin modules, explicit input/output contracts, no ambient provider assumptions, identity-keyed resources, explicit cross-module deps.

All in the docs. Review catches syntax, misses design.

What other module-level stuff have you hit that only showed up in env #2 or #3?

reddit.com

u/samehmeh — 10 days ago

▲ 6 r/Terraform

Most Terraform disasters I have seen trace back to one decision made in week one.

State file boundaries.

One state per environment sounds right when you are starting out. But once your setup grows, it often becomes too large of a blast radius.

One state per account, per region, per logical stack is what survives year three.

Here is why: blast radius.

Last year I watched a team destroy their staging Kubernetes cluster by accident. They ran terraform destroy in the wrong directory with credentials that had access to too much. The same state file covered RDS, EKS, and Route53.

Everything was gone.

Restore from backup took 14 hours.

The fix is not being more careful. The fix is making the careless mistake cost less.

Split your state so a bad apply in sandbox cannot touch prod.

Pin your backend bucket per account, not one shared bucket with key prefixes. Use separate IAM roles so the sandbox pipeline literally cannot write to the prod state bucket.

Directory layout that enforces this:

terraform/
  prod/
    us-east-1/
      networking/
      compute/
      data/
  sandbox/
    us-east-1/
      networking/
      compute/
      data/

Each leaf directory is a separate root module with its own state. Each account has its own S3 backend. The sandbox CI role has no access to prod buckets.

Terraform workspaces solve a different problem. They create separate state files, but they usually share the same backend configuration and do not give you strong access isolation by themselves.

They are not a replacement for separate accounts, separate state backends, and separate IAM roles.

State isolation is the cheapest insurance you will ever buy. It costs an extra 10 minutes of setup and saves you from the 14-hour restore window.

How do you split your Terraform state across environments?

reddit.com

u/samehmeh — 15 days ago