r/cicd

What CI looks like at PostHog in a week: 575K jobs, 33M tests
▲ 12 r/cicd+2 crossposts

What CI looks like at PostHog in a week: 575K jobs, 33M tests

tl;dr: PostHog is ~100 engineers pushing constantly to a monorepo. In one week they ran 575,894 CI jobs, processed 1.18 billion log lines, and ran 33 million tests. We continuously debug their CI with an agent.

Flaky tests were annoying before AI, and now those flakes can block teams from shipping or cutting a release. but AI can also help fix this (because of the ability to automate deep root cause analysis at scale).

mendral.com
u/samalba42 — 1 day ago
▲ 7 r/cicd+1 crossposts

When env vars leak, where do you control blast radius? (Vercel incident)

A lot of recent incidents don’t start in CI/CD — but leaked environment variables seem like a fast way to amplify damage through builds. Curious how folks here think about containing that blast radius at build time, not just preventing the initial leak.

reddit.com
u/Ok-Barracuda5306 — 12 days ago
▲ 10 r/cicd+1 crossposts

[OpenSource] GitHub Action that auto-commits .env.example and fails the PR if you forgot to document a new env var

Keeping .env.example in sync with actual code usage is a manual chore that everyone forgets. I released envsniff to treat documentation-of-vars as a build requirement.

Why use it?

  • Multi-language support: Scans JS, Go, Python, and even Shell scripts.
  • Zero Config: The default setup finds most standard usage patterns.
  • Auto-remediation: You can set commit: true to let the Action maintain the example file for you.

​

- uses: harish124/envsniff@v0.1.0
  with:
    fail-on-drift: true
    commit: true

Check it out here: https://github.com/harish124/envsniff

Pls drop a star on Github

u/Outrageous_Ranger812 — 14 days ago
▲ 3 r/cicd

I have a question about CICD and ADO project

We are a small team working on a project at company X. I am involved in everything related to infrastructure. My question about CICD is, should I create yml files in each repository, or should I create a master repository which takes care of all ? Or maybe a hybrid approach ?

Because currently, I created yml files for CI and for CD pipelines, but the logic across multiple repositories is similar and I am just copy pasting logic trough repositories.

And i suspect most of the time, logic in Repository 1 will also be needed in the other repositories , in the future

reddit.com
u/A-N-D11 — 11 days ago
▲ 10 r/cicd

AI Impact on DevOps and CI/CD!

My organization recently gave us access to codex, claude and gemini pro to try and evaluate all for the daily workflows on both engineering and DevOps side. With a couple of weeks into it, here is my take as a DevOps Engineer-

  1. Codex - Amazing at long running tasks. Handle huge context decently (When working with multiple repos). Skills come handy when you want to offload mundane stuff.

  2. Gemini - Great at doing research, grasping errors from screenshots and working with google ecosystem. Have mostly used this with the Google's Anti-gravity IDE and sadly it not the best out there. The agent often fails with error on high load and needs to be nudged again and again. The auto-completion though works amazingly even across the files.

  3. Claude - Great capabilities, unmatched results. Writes a very clean and modular code. Claude in chrome is amazing to troubleshoot pipeline running in Github and Gitlab (Waiting for the official plugins to move to GA). Only limitation - the amount of tokens it burns is insane.

As a DevOps engineer one of my primary duties is to build CI/CD pipelines which earlier used to take me a couple of hours and can now be completed in minutes(developed, tested, shipped) using AI tools.

My questions is -
1. How deep is AI adoption in you org. in a high trust domain such as CI/CD?
2. As an DevOps engineer how to keep yourself ahead of the curve so that you are not replaced by AI someday?

reddit.com
u/BusyPair0609 — 16 days ago
▲ 2 r/cicd

We ran a Terraform audit on an Azure environment — found 3 issues causing pipeline failures

Recently worked through a Terraform + CI/CD setup in Azure that looked solid on the surface, but had some hidden problems that explained recurring pipeline failures.

The biggest issues:

  1. Unmanaged state across environments

Dev and prod were drifting because state wasn’t centralized.

  1. Module inconsistency

Same resources defined slightly differently across repos — hard to maintain and debug.

  1. Pipelines failing under concurrency

No controls in place → race conditions during deployments.

Curious — how are others handling:

• Terraform state management across environments?

• Preventing drift in multi-team setups?

Would love to hear what’s working (or not working) for you.

reddit.com
u/jkb0751 — 10 days ago
▲ 1 r/cicd

Gave an LLM an SQL interface to our CI logs. Here's what broke first.

Disclosure up front: I'm a co-founder at Mendral (YC W26). We build an agent that debugs CI failures. Not a pitch, sharing what we learned. Mods can take it down if it doesn't fit.

We run around 1.5B CI log lines and 700K jobs per week through ClickHouse for our agent to query. It writes its own SQL, no predefined tool API. The LLM-on-logs angle is covered to death. The CI-specific parts are what I haven't seen discussed much.

1) GitHub's rate limit is the thing that kills you.

15K requests per hour per App installation. Sounds generous until you're continuously polling workflow runs, jobs, steps, and logs across dozens of active repos, while the agent itself also needs to hit the API to pull PR diffs, post comments, and open PRs. A single big commit can spawn hundreds of parallel jobs, each producing logs you need to fetch.

Early on we'd burst, hit the ceiling, fall 30+ minutes behind, and the agent would be reasoning about stale data. Useless if an engineer is staring at a red build right now.

Fix was boring. Cap ingestion at ~3 req/s steady and use durable execution (we're on Inngest) so when we hit the limit we read X-RateLimit-Reset, add 10% jitter, and suspend the workflow with full state checkpointed. When the window resets, execution picks up at the exact API call it left off on, so there's no retry logic, no dedup, no idempotency work. The rate limit becomes a pause button. P95 ingestion delay is under 5 minutes, usually seconds.

2) Raw SQL beat a constrained tool API by a wide margin.

We started with the usual get_failure_rate(workflow, days), get_logs(job_id), etc. It capped the agent at questions we'd thought of. Switching to raw SQL against a documented schema unlocked investigations we never scripted. Recent models write good ClickHouse SQL because there's a huge amount of it in training data. Median investigation across 52K queries is 4 queries, 335K rows scanned, ~110ms per raw-log query.

3) Denormalize everything. Columnar storage eats the repetition.

Every log line in our table carries 48 columns of run-level metadata: commit SHA, author, branch, PR title, workflow name, job name, runner info, timestamps. In a row store this is insane. In ClickHouse with ZSTD, commit_message compresses 301:1 because every log line in a run shares the same value. The whole table lands at ~21 bytes per log line on disk including all 48 columns. The real win isn't the disk savings, it's that the agent can filter by any column without a join. When it asks "show me failures on this runner label, in the last 14 days, where the PR author is X," there's no join to plan around.

What I'm curious to hear from this sub:

- Anyone running an ingestion layer against GitHub Actions (or Buildkite, CircleCI) that has to share API budget with other consumers? How are you splitting it? We ended up keeping ~4K req/hour headroom for the agent and tuning ingestion under 3 req/s. Trial and error.

- Anyone using columnar stores (ClickHouse, DuckDB, Druid) for CI observability specifically, vs general log platforms (Loki, Elastic)? Tradeoffs?

Longer writeup with the query latency histogram and the rate limit graphs is here if you want detail: https://www.mendral.com/blog/llms-are-good-at-sql

u/samalba42 — 15 days ago
▲ 3 r/cicd+1 crossposts

API testing without maintaining test code - looking for beta testers

Hey folks,

I've been building QAPIR (https://app.qapir.io), a tool that generates API test scenarios automatically from API docs or an OpenAPI spec.

The idea is to reduce the amount of test code and setup usually needed for backend testing. You paste a link to API docs (or upload an OpenAPI spec), and in a couple of minutes it generates a working baseline test suite with validations, environment variables/secrets, and chained calls.

Tests can be edited in a simple YAML format or through a UI editor.

Right now it's focused on REST APIs, but I'm planning to add things like:

  • CI integrations (GitHub / GitLab)
  • more protocols (GraphQL, WebSockets, gRPC)
  • additional test steps (DB/cache queries, event queues, webhook testing, HTTP mocks)

It's very early, and I'm looking for a few SDETs, Developers and QA engineers willing to try it for free and give honest feedback.

If you're doing API testing and are curious to try it on a real service, I'd really appreciate your thoughts.

Link:
https://app.qapir.io

Thanks!

u/Distinct-Lemon-2720 — 10 days ago
▲ 1 r/cicd

How do you share node_modules across CI stages in an Nx monorepo without Nx Cloud?

Hi everyone,

I'm currently working as an intern, and one of my tasks is to rebuild/improve our frontend CI/CD pipeline.

We are using an Nx monorepo, and as many of you probably know, caching can become a real bottleneck.

The main issue is with node_modules, which is around ~3 GB. Right now, every stage/job in the pipeline has to download the cache again, and since we have 8 jobs, this adds a huge overhead.

I’m trying to figure out if anyone has already faced this kind of problem and found an efficient solution without using Nx Cloud.

More specifically:

- How do you handle sharing such a large node_modules dependency between stages/jobs?

- Is there a better approach than forcing each job to restore the same cache?

- Do you use artifacts, Docker layers, custom images, or another workaround?

I’d really appreciate any feedback, best practices, or real-world experiences.

Thanks!

reddit.com
u/brahim_- — 4 days ago