r/devops

▲ 18 r/devops

How much of your Terraform, CloudFormation, Bicep etc is actually being written by AI agents in prod?

Context for why I'm asking: I maintain a CLI tool in the IaC space and just shipped a major release that assumes agents are now the primary caller (e.g. predicate flags so the agent doesn't compose jq | python | wc pipelines, output format that strips JSON's redundant field names) rather than humans at a terminal. Before I keep building in that direction, I want to sanity-check with this sub: is "agents writing IaC in prod" actually a thing yet, or am I betting on a future that's still a year out?

u/alikhajeh1 — 9 hours ago

▲ 113 r/devops

How are you actually upskilling to survive the shift from traditional DevOps to Platform Eng / MLOps?

Hey everyone,
I’m currently a Cloud/DevOps engineer. With AI rapidly automating things like boilerplate YAML, standard CI/CD pipelines, and basic log analysis, I'm trying to be proactive about my next career move.
For those already adapting:
Where do you see traditional DevOps going over the next few years?
What do you think is the most reliable, high-demand career shift adjacent to DevOps right now? (e.g., Platform Engineering, MLOps, DevSecOps?)
Would love to hear your thoughts on where to focus my upskilling. Thanks!

u/Fantastic-Leg-5806 — 12 hours ago

DevOps career advice

Hi there,

My name is Cooper, and I’m currently building my path toward becoming a DevOps Engineer.

I’m studying through self-learning programs such as Harvard University CS50x and the IBM DevOps & Software Engineering program.

My roadmap also includes preparing for the Certified Kubernetes Administrator (CKA) certification and the AWS Certified DevOps Engineer – Professional certification.

I’m focusing on building real projects on GitHub to gain practical experience in Linux, Docker, CI/CD, cloud, automation, and Kubernetes.

From your experience, do candidates with strong GitHub projects and certifications still have a real chance to compete in DevOps without a traditional computer science degree?

I’d really appreciate your honest opinion and any advice you can share.

Thank you for your time.

u/hamzabouk20 — 13 hours ago

What does your WLB look like?

I work for a company and the company is way too big to operate the way that we do.

Our entire release process basically hinges on a group of 4-5 platform engineers monitoring the e2e release process, which takes place in a variety of regions across the globe.

One team member often has to stay up, multiple times throughout the week, from 8PM to 4AM when things are really bad, shorter when things go as plan.

To me, this is absolutely insane. They might catch up on sleep the next day, but people are always sick or always out and they have no time to actually work on the platform.

I would never agree to it, I'll quit this job as soon as they ask me to take part in that process.

What do you all have for off-hours expectations?

EDIT:

To anyone who is going to comment on the poor release process. To maybe save yourself the effort, everyone knows it sucks. Everyone knows it can be improved. The company has put effort into improving it, but soon as they start, they get yanked in a different direction and it ceases to be the priority.

Our company is over 5000 employees, over 1000 engineers. It's going to be a slow process to get them to change, and right now they're basically just running on the backs of pure good will from this small team of platform engineers.

u/ninetofivedev — 12 hours ago

▲ 69 r/devops

The real cost of EU cloud vs hyperscalers

I was surprised that EU cloud providers compete well on price with hyperscalers so I decided to do some deeper research and honestly the gap is surprisingly big.

Disclaimer: I'm the author of the blog post and founder of Cirran

u/mpuchala — 16 hours ago

FinOps tools like Vantage/CloudHealth show the storage waste, but engineers still have to fix it manually. How are you handling this?

Hey everyone,

We’ve been told to cut our AWS bill by around 20% this quarter, so we started looking at the usual stuff.

We set up Vantage, also looked at CloudHealth, and they’re pretty good at showing the obvious waste: idle EC2, unattached Elastic IPs, old snapshots, oversized instances, etc.

That part is fine.

The annoying part is EBS.

The tools are flagging terabytes of overprovisioned storage across live stateful workloads. They’re not wrong either. A lot of these volumes are clearly bigger than they need to be.

But once you ask engineering to actually shrink them, the whole thing gets stuck.

And I get why. The usual process is still basically:

create a smaller volume
format/partition it
rsync or snapshot/migrate
plan a maintenance window
stop services
swap mounts
test everything
hope nothing breaks

So now we have a nice dashboard telling us exactly how much money we’re wasting, but no one really wants to own the risk of fixing it manually.

Is everyone else just accepting this as part of the AWS tax, or have you found a better way to bridge the gap between FinOps visibility and actual remediation?

I’ve seen tools like Datafy trying to handle the block storage side more directly, but I’m still skeptical of anything that touches live storage automatically.

Curious what people here are using in practice.

u/RougeRavageDear — 8 hours ago

I am stuck in Secret-Zero rabbit hole (Hashicorps Vault/OpenBAO)

My company's secret management is a mess so i am trying to set up OpenBAO for secret management.

At first it looked like a good idea, because i thought it would protect from an attacker that would gain shell access to the server (he could not read .env files or /proc/<pid>/environ etc...).

But when i dag a little deeper into it, i don't understand what is it's benefit. Any method to implement auto-unseal turns out not really more secure thant .env files :

OpenBAO with auto-unseal via transit : an attacker that gains access to the transit bao can get everything he wants + you have to keep a token in the main bao to login to the transit one, which comes down to .env files security level.
OpenBAO with KMS and IAM (Azure, AWS...): an attacker that gains access to the server can query the IP 169\.254.169.254 and access the master keyr.
OpenBAO with KMS and static KMS credentials : same problem with .env files.
OpenBAO with static key "seal "static" : equivalent to .env files.
OpenBAO with HSM : equivalent to .env because any attacker with access to the server + pin can get the key.

Shamir's secret sharing is more secure than these all (depending on where and how each person store's it's share) but it is not suited for CI/CD etc.

What are your thoughts on this ? Is it possible to set up a secret management system with 0 secrets or something that is as secure and production-ready ?

u/redaben_ — 16 hours ago

▲ 8 r/devops+1 crossposts

New to AWS/devops, what to focus on?

Hi,
I’m a backend dev with 3+ yoe.

I got a job with a small fintech startup (4 devs) where we would have to wear several hats.

They are going to prod next month and they will hire a consultant devops/security for helping out during the next three months.

I have been told I will shadow him with the idea I will own that part but the main responsibility will be backend development with Java.

The infra stack is AWS (EC2, S3, ECR, CodeDeploy) some terraform, grafana, Prometheus, etc

I’m new to AWS, I have used in a side project ECS, Cloudformation and some other stuff but it was using LocalStack.

Given the bast amount of resources available for AWS, any recommendations for getting up to speed? (I will join in two weeks)

Thanks

u/voreno87 — 16 hours ago

Do AI agents need a new kind of work environment to become truly useful in production?

Most agent infrastructure focuses on the harness: tool calls, planning loops, retries, evals, approvals, tracing, guardrails, and memory. But I’m not sure that is enough for agents to become truly productive inside organizations.

Coding agents work better partly because software already has a production environment: repos, files, tests, CI, diffs, PRs, reviews, deployment, rollback, and ownership. The agent can operate inside a world where work has state, verification, and a path to being accepted.

Most business work does not have that. It is spread across Slack, docs, tickets, email, dashboards, meetings, and people’s heads. The harness can execute the agent loop, but it does not necessarily define the work contract: where state lives, what can be changed, what evidence is required, who approves, how artifacts are versioned, and who owns the final result.

Do agents only need better harnesses, or do they need AI-native production environments where the work itself becomes explicit, durable, reviewable, and accountable?

u/TimkiP__ — 16 hours ago

▲ 44 r/devops+1 crossposts

Graduating this year and want to start DevOps/Cloud Engineering — where should I begin?

Hey everyone, I’m graduating this year and I want to build my career in DevOps/Cloud Engineering. Right now I’m learning Python basics and trying to understand what roadmap I should follow next.

I’m confused whether I should:

Learn from YouTube/free resources first

Join an online/offline course later

Focus on Cloud (AWS/Azure/GCP) directly or first build strong fundamentals

Can anyone suggest:

A good beginner roadmap for DevOps/Cloud

Best YouTube channels/playlists to follow

Platforms/courses that are actually worth it

Skills I should focus on first (Linux, Networking, Docker, Git, AWS, etc.)

I’d really appreciate advice from people already working in DevOps/Cloud. Thanks!

u/Nearby-Pickle1684 — 1 day ago

▲ 1 r/devops+1 crossposts

How I built CloudOps Assistant — a Slack bot that analyzes cloud infrastructure through conversation

I was tired of bouncing across 5–6 AWS consoles for routine ops on my own infra, so I tried wiring an AWS MCP server straight into a Slack bot. "Just an LLM with tools" — easy, right?

It broke in three ways that are probably pretty common once MCP leaves a single-developer setup.

Single-session design. The MCP server is built around one credential set per process. As soon as the bot needs to handle more than one identity — multiple users, or even one person juggling several AWS accounts and roles — you're either leaking permissions or serializing everything behind a single credential.
Slack's response window vs. real analysis time. Useful queries ("which ECS service drove the cost spike this week?") take 20–60s and multiple tool calls. Slack times out long before the LLM is done.
One-shot tool calls aren't enough. Almost every useful query was a chain: list resources → filter → fetch metrics → correlate. The model needs to loop until it decides it has the answer, not stop after the first tool returns.

So I rewired it.

- Per-identity MCP proxy. Each identity gets an isolated subprocess where its STS AssumeRole credentials are injected. Pooled, not one-per-request, so cold starts don't kill UX.

- SQS between Slack and the worker. Slack ack returns immediately; the worker processes async and posts back into the thread. Timeouts stop being a thing.

- Agent loop, not single tool call. The LLM keeps calling tools (Cost Explorer → CloudWatch → tag lookups → IAM) until it claims it's done. Bounded by max-iterations and a budget.

Cost spike investigations, "find anything publicly exposed", and "what caused yesterday's RDS CPU spike" are all answerable from Slack now, without opening a console.

Honestly the LLM was the easy part. The interesting work was the permission boundary and execution flow around it.

Curious how others have handled credential isolation when putting LLM agents in front of cloud infra — a proxy-per-identity feels heavy but I haven't found a cleaner pattern.

u/basejb — 1 day ago

How extensively do you use the install-* actions?

Hey everyone!

In the context of all the previous github actions compromissions, I'm strongly reducing the amount of different actions we use in my company.

What's your take on the install-* actions, like install-poetry, setup-terraform, setup-trivy etc.? Otherwise, do you manually install them with curl commands? Or use tools like mise-en-place?

What are your strategies to reduce 3rd party exposition?

Cheers!

u/Juloblairot — 1 day ago

Job switch guidance

I have 4.6 years of experience in DevOps and currently work at TCS with a salary below 5 LPA. I have been actively trying to switch jobs for the past three months but have received only three interview opportunities so far. I am looking for serious guidance to improve my chances of switching jobs.

u/Fun-Jello8158 — 1 day ago

▲ 13 r/devops

Ephemeral Environment

Really looking for advice or tips on how others have handled this setup.

We need to spin up ephemeral environments whenever a release PR is opened (specifically from UAT → master). Our goal is to run end to end tests in these environments as part of release validation, plus also support manual testing and nightly runs. And perform cleanup after tests.

Our current stack looks like this:
Jenkins
Kubernetes
ArgoCD
Kustomize
Gitea

One major constraint is that I don’t have full cluster access we’re restricted to a single namespace only.
Has anyone implemented something similar under these limitations? How did you structure your ephemeral environments, especially with ArgoCD + limited Kubernetes permissions?
Any patterns, tooling approaches, or lessons learned would really help.

u/Kitchen_Hornet1943 — 1 day ago

▲ 10 r/devops

Looking for someone to learn Kubernetes, Terraform, GCP

Hey everyone,

I'm looking for someone to learn and improve DevOps skills with. I've been learning things like Kubernetes, Python, and GCP, and I think it would be great to have someone to study with, share knowledge, and maybe build some small projects together.

A little bit about me. I’ve been working as a DevOps Engineer in one company for around 3 years. I feel a bit stuck right now, not because of laziness, but mostly because there aren’t many opportunities to grow in my current job. Most of the time I work on CI/CD pipelines, write some Ansible playbooks, connect Spring applications with tools like Graylog and Prometheus, and at the end of the process we mainly use Docker.

I'd really like to grow more in Kubernetes, ArgoCD, and GCP. I'm also looking for ideas on how to create more real-world scenarios and practical projects to improve my skills. If you have suggestions about what I should focus on or learn next, I'd really appreciate it.

My goal is to go from zero to hero in modern DevOps/cloud technologies.

If anyone is interested in learning together, feel free to comment or send me a DM.

u/DrissPl — 1 day ago

What bottlenecks still prevent teams from achieving much higher productivity across projects?

In theory, integrating Slack, Notion, Linear, GitHub, Google Drive, CRM, email, and databases should create a shared work environment with sustained context.
In practice, What bottlenecks still prevent teams from achieving much higher productivity across projects?

u/CompetitiveAdagio331 — 1 day ago

What does actual devops engineer do

Rn,im doing devops. But GODDAMN its boring. I create pipeline, have it build,obfuscate,orchestrate stuff,bla bla,but when its time to run/test my work,its almost 2 hrs per run. While its running,i do nothing. Yes,i do study, when doing nothing,but it gets boring. If only my table is at a better location, i wudve run netflix (without earbuds). Is this a normal thing in devops? or is it normal in the devops field. Btw im in a small software company, so workload on my end is mostly light, medium at best.

u/konkon_322 — 2 days ago

Weekly Self Promotion Thread

Hey r/devops, welcome to our weekly self-promotion thread!

Feel free to use this thread to promote any projects, ideas, or any repos you're wanting to share. Please keep in mind that we ask you to stay friendly, civil, and adhere to the subreddit rules!

u/AutoModerator — 2 days ago

▲ 38 r/devops

Career pivot from bare metal infra to DevOps

Hi, I'm in my first real IT role infrastructure engineer role at a hosting company. Before this I was more on the telecom and hardware side, so the past couple of months have been a steep learning curve. I've picked up a lot: managing large fleets of bare-metal servers, virtualization, setting up monitoring for infra(Telegraf, Grafana), Ansible automation, and some security tooling. But mainly with the help of AI tools.

What I'm missing: Kubernetes (zero experience), CI/CD pipelines, cloud platforms (AWS/Azure), and Terraform. Basically everything the "DevOps" job market seems to want.

Some days I feel like I'm growing fast. Other days I feel drained there's a lot to absorb. Just want to know if I'm headed in the right direction or wasting time.

Anyone made a similar transition? What would you prioritize first?

u/Repulsive_Island20 — 2 days ago

Multi-agent AI code review: 17 of 18 findings false. Lessons from burning credits

I've been personally experimenting with multi-agent code review: separate specialist agents for security, implementation, testing, and architecture, running them on every PR I open. On one last month, 18 findings reported, only 1 survived manual verification.

The failure modes were consistent:

Agents flagged files that weren't in the diff at all.
"code does X" claims without quoting the offending line.
Functions read in isolation, missing upstream guards.
Pre-existing patterns flagged as PR-introduced.

What stuck: a triage step before dispatching specialists. It pulls the + line list from gh pr diff, builds a small context bundle (changed files, repo conventions, analogous code), and picks which specialists to run based on PR size and surface. Docs-only? Skip security and architecture. Multi-file change touching auth? All four. Specialists must then quote the offending line for any finding they raise; if they can't, it's dropped before reaching me.

I haven't wired this into CI yet, but the triage step would slot in cleanly as a pre-merge check.

If you're running AI review on real PRs, how are you bounding the false-positive rate? Per-agent diff scope, post-hoc filtering, or something else?

u/alejandro_such — 1 day ago