u/sszz01

▲ 2 r/soc2

what evidence does your auditor actually want for a billing bug fix? PR + CI or something more

going through SOC2 Type II and got a specific ask from our auditor that caught us off guard. we had a billing bug in prod. fixed it. had PR approval and CI passing. auditor came back and asked for evidence that the fix was actually tested against the original crash. not just that tests passed in general, but something showing here's the crash, here's the test that reproduces it, here's proof the fix makes it pass.

is that a normal ask or is our auditor being unusually strict? how are you generating that evidence right now, manually writing it up per incident or is there tooling that handles it?

specifically asking about billing/payment code, auditor seemed to care more about those paths than everything else.

reddit.com
u/sszz01 — 3 days ago

what does your SOC2 CC8.1 evidence actually look like for a production billing fix?

going through this with a client and got stuck on something specific. auditor asked for evidence that a billing bug fix was tested against the actual crash. not just PR approval and CI passing, but something that says here's the crash, here's the test that reproduces it, here's proof the fix works.

how are you handling this in practice? are teams writing this up manually? is there tooling that generates it? or is PR + CI usually enough for most auditors?

specifically asking about billing/payment code where auditors seem to care more than usual.

reddit.com
u/sszz01 — 3 days ago

how are you satisfying PCI DSS 6.3.2 for production bug fixes? what does your testing evidence actually look like

practice for production bug fixes specifically.

for planned features it's pretty clear. you write tests, ci runs them, you have the artifact. but for production incidents where you're patching billing or payment code under pressure, the evidence trail often looks like: sentry alert, hotfix branch, pr approval, merge, deploy. no specific documentation that the fix was tested against the original crash.

when your auditor asks show me how you tested this fix for a production payment bug, what are you actually showing them? is pr approval + ci passing enough? do you need something that specifically demonstrates the root cause was reproduced and resolved?

asking because i'm trying to build something that automates the artifact generation for exactly this scenario - deterministic crash reproduction in a sandbox + structured evidence output mapped to pci control IDs but i want to understand if auditors actually care about this or if i'm overengineering it.

reddit.com
u/sszz01 — 4 days ago

what does your SOC2 change management evidence actually look like for a production bug fix

going through soc2 type II and got stuck on a specific question from our auditor that i wasn't expecting.

we had a billing bug in prod last quarter. found it, fixed it, deployed it. but when our auditor asked for evidence that the fix was tested before deployment and specifically that the fix addressed the root cause we kind of froze.

we had a PR with review approvals. we had ci passing. but we didn't have something that said here is the crash that happened in production, here is the test that reproduces it, here is proof the fix makes that test pass. auditors apparently want something closer to that second thing for PCI DSS 6.3.2 and SOC2 CC8.1.

so how are you handling this in practice? are you manually writing up a repro + remediation doc for every prod bug? is there tooling that generates it? does your auditor actually care about this level of detail or is PR approval + CI passing good enough?

specifically for billing/payment-touching code, our auditor seemed to care more than i expected. curious if others have run into this or if i'm in a strict audit firm.

got annoyed enough that i started looking into automating the artifact part. there's an approach where you pull the sentry event, reproduce the crash deterministically in a sandbox, and output a structured artifact that maps to pci/soc2 control IDs. still figuring out if this is actually what auditors want or if it's overkill.

reddit.com
u/sszz01 — 4 days ago
▲ 0 r/sre

have you ever pushed a fix and realized days later it didnt actually fix anything

honest question because this has happened to me more than once.

you push a fix for an incident, things go quiet, you assume it worked. then like 3 days later the same error comes back and turns out you patched the wrong code path or only handled one of the inputs that was actually breaking. now you're explaining it in the post-mortem.

how do you actually verify a fix is the right one before you ship it? some teams write a failing test first, fix it, watch it pass. some just deploy and watch dashboards. some have a staging env that catches it. some just hope.

curious what your actual flow looks like. have you ever shipped a fix that turned out not to actually fix the bug? how did you find out - alert firing again, user complaint, metric drift or smth else?

i honestly got annoyed enough about this that i started building something to make the verification step automatic. paste a sentry url (or any traceback), it grabs the frame state at the crash and runs that state against your branch in a docker sandbox, gives a yes/no on whether the bug still reproduces. still figuring out if anyone else cares or just me.

does this match anything you deal with on call, or is watching dashboards for a few days good enough?

reddit.com
u/sszz01 — 4 days ago

Do devs on your team actually write repro tests before fixing prod bugs or do you have to fight them on it

Curious how QA people deal with this, because from the dev side, I never know what the actual expectation is.

When a prod bug comes in, do the devs on your team reproduce it as a failing test before they fix it? Or do they just read the trace, push a fix, and call it done?

I've been on teams where nobody writes repro tests unless QA explicitly asks for one. And I've been on teams where it's just expected. I've never seen a consistent standard across companies, though.

Asking because I keep spending 30-45 mins manually writing repro tests from Sentry traces, and I'm wondering if that's normal or if I'm just bad at it. Also wondering if QA even cares whether the repro test exists, or if they just care that the bug is fixed and doesn't come back.

Does your team have an actual process for this, or does everyone just wing it?

reddit.com
u/sszz01 — 5 days ago

Do you write a repro test before fixing a prod bug or just push the fix?

When something breaks in prod, what does your actual process look like? I always end up in this loop - read the Sentry trace, try to reproduce it locally, get the inputs slightly wrong, fix the test, run it again, finally get it reproducing, then actually fix the bug. It takes 30-45 mins just on the repro before I've even touched the real problem.

I've talked to a bunch of devs and everyone does it differently. Some write the failing test first, some just read the trace and push, some deploy and watch monitors.

Curious what people actually do vs what they think they should do, especially on anything critical like billing or auth where a bad fix is worse than leaving the bug in.

How long does writing a repro test take you?

reddit.com
u/sszz01 — 5 days ago
▲ 1 r/django

how do you actually handle prod bugs. do you write a repro test or just fix and deploy?

honest question because i've gone back and forth on this myself.

when sentry fires do you actually reproduce it locally as a failing test before touching anything, or do you just read the trace, understand what broke and push the fix?

i always end up spending like 30-45 mins just getting the repro right. reconstructing the state, getting deps working in the test, running it, realizing the inputs are slightly off, running it again. by the time it actually reproduces i've lost the whole debugging flow.

got annoyed enough that i started building something to automate it. grabs the frame locals from sentry, generates a pytest, runs it in docker against your branch. still figuring out if this is actually useful to other people or just my own problem.

how long does it take you to write a repro test from a sentry trace? do you even bother or just push and monitor? has skipping it ever come back to bite you?

reddit.com
u/sszz01 — 5 days ago

how do you actually handle prod bugs. do you write a repro test or just fix and deploy?

honest question because i've gone back and forth on this myself.

when sentry fires do you actually reproduce it locally as a failing test before touching anything, or do you just read the trace, understand what broke and push the fix?

i always end up spending like 30-45 mins just getting the repro right. reconstructing the state, getting deps working in the test, running it, realizing the inputs are slightly off, running it again. by the time it actually reproduces i've lost the whole debugging flow.

got annoyed enough that i started building something to automate it. grabs the frame locals from sentry, generates a pytest, runs it in docker against your branch. still figuring out if this is actually useful to other people or just my own problem.

how long does it take you to write a repro test from a sentry trace? do you even bother or just push and monitor? has skipping it ever come back to bite you?

reddit.com
u/sszz01 — 5 days ago

spent way too long writing repro tests manually so i automated it. does anyone actually do this or is it just me

so every time prod broke i'd spend like 30-45 mins just getting to the point where i could reproduce the bug locally. read the sentry trace, guess at the state, write a test, run it, wrong inputs, fix it, run again. by the time it was actually reproducing i'd already lost half my debugging flow.

got annoyed enough that i built something to do it automatically. paste the sentry URL, it grabs the stack trace and the actual variable state from the crash frame, generates a pytest, runs it in docker, tells you if it still reproduces on your branch or you already fixed it.

hasn't really been tested outside my own projects yet so genuinely no idea if this is useful to anyone else or just solves my specific flavor of pain.

curious though. when something breaks in prod, what does your actual process look like? do you bother reproducing it locally before you fix it, or do you just read the trace and push? and has that ever burned you like deployed a fix that didn't actually fix it and found out the hard way?

also if you do write repro tests manually, how long does it usually take you? trying to figure out if 30-45 mins is just me being slow or if that's normal

reddit.com
u/sszz01 — 5 days ago

I built a tool that turns a Sentry URL into a failing pytest. Want honest feedback on whether this is useful

I was working on backend the other day and kept running into the same thing. Every time a production bug hit, I'd spend 30-45 minutes doing the same loop - read the Sentry trace, manually reconstruct the state, write a pytest, run it, realize I got the inputs slightly wrong, fix the test, run it again. By the time I had a reproducing test I'd burned nearly as much time on the repro as on the actual fix.

So I started building something to automate it.

The idea is that you paste a Sentry issue URL, it pulls the stack trace and frame locals, synthesizes a failing pytest that reproduces the exact crash, runs it in a Docker sandbox against your current branch, tells you still reproduces or your branch fixed it.

The part I think actually matters is the frame locals. It captures the exact production state at the crash frame and replays it. So the test is asserting against what actually broke in prod, not a guess at what might break. Works with any Python traceback too, Sentry is just the cleanest input.

Before I go further with this, two honest questions:

  1. Do you actually write a local repro test before fixing a production bug, or do you read the trace, understand it, fix it, and deploy?
  2. If this worked reliably and saved you that 30-45 minutes, would you pay for it or is this only useful if it's free?

Just trying to figure out if I'm solving a real problem or one I invented for myself. If this matches something you deal with, I'd genuinely like to hear how you currently handle it.

reddit.com
u/sszz01 — 5 days ago