r/platformengineering

i feel like the "Golden Path" was built for people way smarter than me lol

my company just rolled out this big internal platform and it’s supposed to be "self-service," but i feel like i'm failing at it.

every time my PR fails to build, the error message is like 10 pages of k8s events and helm chart errors. i try to fix it myself because i dont want to be the guy who is always pinging the platform team on slack, but i end up spending 4 hours getting nowhere before i finally give up and ask for help.

is it supposed to be this hard to figure out why a build failed? i feel like a burden to the platform team. do your juniors actually self-serve their way out of broken pipelines, or are you guys also stuck answering "why did my build fail" questions all day?

i want to get better but the logs feel like they're written in another language

reddit.com

u/Beneficial-Minute142 — 7 days ago

▲ 0 r/platformengineering

Reality check: am I building something usefull, or just a complicated trash and wasting my time?

I am a DevOps engineer in small company, all our infrastrcure is on prem, we use GitLab, ArgoCD and Kubernetes. About 3 months ago I started building what was supposed to be a small Lens-style Kubernetes dashboard in Go + React: Kubernetes UI, SSO, maybe some basic cluster operations. I thought it would be a week of work.

It has grown into something much broader, and I would like a reality check from people who have actually built or operated internal platforms.

The current product is a Go + React monolith:

Go/Gin API serving a React SPA
Postgres for product data
Redis/event bus for fan-out and async work
background job worker
Kubernetes dynamic/typed clients and informer caches
plugin-style integrations for GitOps, SCM, metrics, dashboards, and auth
one separate controller-runtime binary for platform-owned controllers

It is not a microservices architecture. The idea is still one cohesive product surface, not a collection of small services.

What it can do today, or is close to doing:

Kubernetes resource browsing/editing, logs, events, exec, CRD discovery, and YAML/template helpers
application and service modeling/catalog
service discovery from running Kubernetes resources, including grouping, review, curation, and promotion into the service model
onboarding flows for new services, currently more of a hardcoded golden path than a generic workflow engine
deployment targets, traits, releases, promotions, and basic pipeline visibility
GitOps/SCM integration, currently GitLab/ArgoCD-shaped but being moved toward neutral adapter interfaces
reliability scorecards, SLI checks, incident tracking, and service health views
access request workflows and product/Kubernetes RBAC separation
host automation through an Ansible-style subsystem
audit/security work, including tenant scoping, idempotency, webhook replay protection, provenance, and better state-transition rules
a cloud shell feature implemented as a Kubernetes CRD plus controller-runtime reconciler; each user gets a pod with a custom image, PVC-backed home, and imported kubeconfig

The product direction is to make this feel less like "many admin tabs" and more like a platform control plane that answers:

What applications and services do we own?
Where are they running?
What changed, who changed it, and why?
What needs review or approval?
Is the runtime healthy?
How do developers safely create, operate, and troubleshoot services?
Which external systems are backing each workflow?

Planned / in-progress areas:

stronger tenant isolation and auditability
stable service identity that survives repo moves, cluster moves, and rediscovery
vendor-neutral adapters for SCM, GitOps, CI/CD, registry, metrics, dashboards, automation, secrets, and ticketing
renaming the Ansible surface into broader Host Management
host inventory via SSH/Ansible-style discovery
host role discovery with operator confirmation
database fleet monitoring
pipeline/job classification and better pipeline views
analytics and reliability dashboards
a cleaner modular-monolith structure so future extraction is possible

The architectural question I am wrestling with is whether the next step should be to treat the platform more like a Kubernetes-native control plane.

The rough long-term architecture would be:

Postgres remains the store for queryable product data: audit logs, releases, workflow history, scan runs, incidents, dashboards, and search/list views.
Kubernetes CRDs become the source of truth for desired/current state that needs reconciliation: applications, services, deployment targets, environments, hosts, database targets, pipelines, promotions, etc.
Controllers reconcile those CRDs and write status.
Projectors watch CRDs and write current-state projections back into Postgres for the UI/API.
Event consumers write append-only facts into Postgres for audit/history/provenance.
Managed clusters are reached remotely via kubeconfigs; platform CRDs are not installed into every workload cluster.
Crossplane could be the default adapter for managed cloud resources like databases, object stores, IAM, Kafka, etc., but not a hard requirement for on-prem.

The part that feels right to me: a lot of the domain is naturally desired-state/reconciliation-shaped. Onboarding, discovery, host inventory, database monitoring, deployment targets, drift, promotions, and cloud resources all have some "observe current state, compare to desired state, converge, record status" flavor.

The part that worries me: this could also become an over-engineered dashboard with CRDs, controllers, projections, Crossplane, and a lot of machinery before the product proves enough value.

So I would love a blunt reality check:

Does this sound like a useful platform product direction, or am I solving a problem that does not really exist?
Which parts would you prioritize first if the goal is value for real platform teams?
Are CRDs/controllers a good eventual fit here, or should they stay limited to a few internal features?
Would you expose public intent CRDs to users/GitOps, or keep CRDs purely internal behind the HTTP API?
Is Crossplane integration useful, or does it add too much operational burden?
What would make you immediately skeptical of a platform like this?

I am not looking for encouragement. If this sounds like the wrong direction, I would rather hear that now. For context, I am currently the sole architect/developer working on this at 50% capacity. The codebase is around 95k lines of Go and 30k lines of TypeScript. More than half of Go code and all of typescript code is written by AI.

EDIT — update on "why not just use X?"

A few alternatives keep coming up. Quick take on each:

Headlamp / Lens / OpenLens. Kubernetes resource browsers. Excellent at that. Not an IDP — no catalog, no onboarding, no GitOps integration, no SRE/drift/audit workflows, no host inventory beyond what's in the cluster. My K8s browsing surface is roughly Headlamp-equivalent, but that's maybe 15% of what I'm building.

Backstage. This is the hardest comparison and the one I think about most.

Caveat up front: I have never run Backstage in production. My exposure is ~30 minutes hands-on plus docs. So everything below is how I read its design, not lived experience. Correct me if I'm wrong on any of this — that's part of why I'm posting.

How I read the distinction in one line: Backstage looks like a portal/plugin framework. I'm trying to build an opinionated operational platform/control plane with batteries included.

Concretely, from the docs:

Backstage's identity reads as portal + plugin host. Mine is control plane that ships the workflows themselves natively — onboarding, promotion, drift detection, reliability scorecards, governance, host automation — rather than as plugins you bring.
Backstage backend is TypeScript/Node. Mine is Go. That matters more than language religion: client-go, controller-runtime, informer caches, CRD codegen, and direct Kubernetes type registration are first-class in Go. Runtime-aware catalog entries — showing live K8s state, drift, health, recent operational actions next to a service — feel materially easier when the backend speaks Kubernetes natively. The cloud-shell feature I mentioned (CRD + controller-runtime reconciler) fits existing Go patterns; I'd expect it to be a heavier lift in a Node backend.
Backstage looks catalog-centric. Mine is workflow-centric. Catalog is one surface among several; the destination is "from intent to deployed-and-operable service in one continuous flow," not "register and link your existing services." This framing in particular I'd love pushback on — if Backstage is actually closer to workflow-centric than its docs suggested to me, that changes the comparison.

If you've already invested in Backstage and it's working for you, my project isn't for you. Who is it for is something I'm still figuring out — I started by building for my own org and walked into a much bigger surface than I expected. Whether the audience generalizes beyond my own situation is part of what I'm asking in this thread.

Port / Humanitec / Cycloid. Closer in product shape than Headlamp or Backstage. Mostly SaaS/managed; mine is self-hosted, on-prem-friendly (my org runs everything in-cluster, behind a corporate boundary, no internet egress required). Direct K8s ops surface is much deeper. Less convenient than SaaS; more controllable, and you own the data.

Crossplane. A substrate, not a alternative. The plan is to use Crossplane as the default adapter for cloud-managed resources and wrap it with the UX, workflows, catalog, and governance Crossplane deliberately doesn't ship. Deployments that don't use cloud resources don't need Crossplane installed.

What I'm actually trying to solve. In my org and most orgs I've worked in, the workflow that matters — go from "we want service X in env Y" to "deployed, healthy, observable, governed" — is fragmented across GitLab + ArgoCD + Grafana + ad-hoc scripts + Confluence spaces + runbooks + an Excel sheet of who owns what + Email templates. Backstage helps the discovery side, Headlamp helps the cluster-ops side. Nothing I've found takes responsibility for the workflow continuity, on-prem, with a Go backend that speaks Kubernetes natively.

That's the bet. Whether it's worth making is the actual question this thread is for.

reddit.com

u/Legitimate-Crazy-298 — 4 days ago

▲ 6 r/platformengineering

Are AI coding agents creating a new platform problem inside engineering orgs?

I’m trying to understand how larger engineering teams are handling the operational side of AI coding tools.

A lot of teams seem to be adopting Copilot, Cursor, Claude Code, internal agents, etc., but I’m curious what happens after the first wave of adoption:

- Who decides which tools are allowed?

- How do you control repo/app access?

- How do you manage shared context, prompts, rules, and coding standards?

- Are teams tracking output quality, security issues, cost, or model usage?

- Does security/compliance care yet?

- Is this owned by platform engineering, DevEx, security, or individual teams?

I’m exploring whether there’s a real need for an “AI engineering control plane” for engineering orgs, or whether this is still too early / already solved internally.

For people at teams of 20+ engineers using AI coding tools: what’s actually painful here?

reddit.com

u/Huge-Advertising-951 — 5 days ago

▲ 5 r/platformengineering+1 crossposts

Golden paths should translate kubernetes errors at the boundary

A junior engineer at a fintech tried to ship a service through the company's golden path. The deploy failed. The platform spit back a forty-line Kubernetes event chain about admission webhooks, CEL evaluation, and a missing label selector. Three hours later, a senior on the platform team translated it: the pod template was missing one annotation.

That's not a developer skill issue. That's a platform bug.

Here's what I see people get wrong about golden paths. They expose the raw cluster errors, call it transparency, and assume the developer will figure it out. Real golden paths catch those stack traces at the boundary and rewrite them into something a human can act on.

A few things that separate a real golden path from a thin wrapper:

Translate errors at the boundary, not in Confluence.

If your validating webhook rejects a deploy because a team label is missing, the developer should never see the webhook name. The platform should catch the rejection and surface:

Error: deployment.yaml is missing the required `team` label.
Add it under spec.template.metadata.labels and re-run `platform deploy`.

Not:

admission webhook "vpod.kb.io" denied the request: ValidatingAdmissionPolicy 
'require-team-label' with binding 'require-team-label-binding' denied request: 
expression 'has(object.spec.template.metadata.labels.team)' evaluated to false

The second one is correct. It's also useless to a service developer who has never read a CEL expression in their life.

Build an error catalog, treat it like product copy.

Every rejection your platform can produce should map to a short, actionable message. Keep it in code, not a wiki. Something like:

var errorMap = map[string]PlatformError{
    "require-team-label": {
        Message: "Missing `team` label on deployment.",
        Fix:     "Add `team: &lt;your-team&gt;` under spec.template.metadata.labels.",
        Docs:    "platform.internal/errors/team-label",
    },
}

When the webhook fires, the CLI looks up the policy name and prints the friendly version. Raw event stays in the logs for the platform team. Developers get the one line they need.

Validate locally before the cluster ever sees it.

Half the forty-line errors should never reach kube-apiserver. Run the same OPA, Kyverno, or CEL policies in platform validate so the developer gets the rewritten error in two seconds at their laptop, not three minutes into a CI job.

$ platform validate
✗ deployment.yaml:12 missing required label `team`
  fix: add `team: payments` under spec.template.metadata.labels

Same rules, same source of truth, faster feedback loop.

Measure time-to-decode, not just deploy success rate.

Track how long it takes a developer to go from a failed deploy to a fixed deploy. If the median is over ten minutes, your error messages are the bottleneck, not your pipeline. Sit with a developer for an afternoon and watch them hit a real failure. The list of things to fix writes itself.

Stop calling raw errors "transparency."

Exposing the underlying primitives is fine for the platform team's debug mode. It is not a feature for application developers. Transparency is "here is exactly what to do." Dumping a Kubernetes event chain is the opposite, it pushes your job onto someone who does not have the context to do it.

Done right, a golden path looks boring from the outside. Deploy works, or you get one sentence telling you what to change. The forty-line stack traces still exist, they just live in the platform team's logs where they belong.

What error messages from your internal platform have you had to decode this month, and which one would you rewrite first?

reddit.com

u/samehmeh — 3 days ago

▲ 12 r/platformengineering+2 crossposts

Burn - K8s cost waste by namespace and pod. Just kubectl, no deploy

Found this as a lightweight alternative to OpenCost. I didn't want to deploy anything into the cluster, just get quick insights into where the money is going. It runs locally via kubectl, pulls real pricing from AWS/Azure/GCP, and breaks down costs by namespace and pod.

github.com

u/tcpud — 11 hours ago