u/Alternative_One_4804 — reddlx

Spent the last few months wiring Claude Code into the rest of my dev workflow, not just the editor. So it picks up tickets, writes code, runs a review pass, and puts up an MR for me to look at. Plus a persistent knowledge layer so it doesn't start every session from zero.

If I had to pick one thing I got right: the agent is not what runs the workflow. Plain Python handles the mechanical stuff (API calls, git, tests). The agent only gets invoked when something actually requires judgment.

I started by letting the agent do everything. It read the configs, made API calls through tool use, managed git, kept track of its own state. Slow, expensive, and it would lose its place every so often. Once I split the mechanical work out into plain Python, the workflow phases got about 10x faster and failures actually became debuggable.

The flow for a single ticket now goes like this:

Orchestrator (Python): fetch the ticket, search a local knowledge wiki for related decisions, set up a worktree, assemble a context brief for the agent.
Claude Code: takes the brief, writes the code.
Validation (Python + a separate review agent): tests, lint, code review pass. If anything fails, hand it back to the agent, retry up to three times.
Ship (Python): write a proposal into a dashboard, wait for me to approve it, then push and open the MR.

The agent only runs in step 2 and the retry loop in step 3. Everything else is deterministic.

A few governance choices that ended up mattering more than I expected:

- The system never executes irreversible actions (merge, close ticket, send a message) without an explicit human approval. It creates a proposal that I have to click on.

- A separate review agent, configured with no edit or write permissions, runs the code review pass. Splitting review and implementation into two isolated contexts caught a class of issues the implementation agent kept missing on its own.

- The wiki tags facts as verified, inferred, or human-provided. Without that tagging, agents end up treating their own past hallucinations as truth.

Things I'm still wrestling with:

- Anything spanning multiple repos. The agent loses coherence across services.

- Tickets that are too vague. Output looks fine, but is often wrong.

- Over-engineering. It adds error handling and abstractions for hypothetical needs.

- Long-running sessions. Earlier context falls out of effective attention.

Would love feedback, especially from people who have built something in this space. What did you keep, what did you throw out, and where do my decisions look wrong to you?

I've been thinking about and building a setup that uses Claude Code across my whole dev workflow. Tickets, code review, MRs, persistent knowledge between sessions. Not just inside the editor.

Here's what I figured out, including the things I got wrong first.

The main idea, in plain words: don't let the LLM run the workflow. Use plain Python for the mechanical stuff (API calls, git, tests) and only call the agent when there's actually a judgment call to make.

The first version did the opposite. The agent ran everything: read 200-line config files, made API calls through tool use, managed git, tracked its own progress. It was slow, ate tokens, and would skip steps. Moving the mechanical stuff to plain Python made phases 10x faster and made failures actually debuggable.

The flow now looks like this for a single ticket:

Python orchestrator: pull the Jira ticket, search the wiki, create the worktree, build a context brief.
Claude Code: read the brief and write the code.
Python + a separate review agent: run tests, lint, dispatch a code review pass. Loop back to the agent if anything fails (max 3 times).
Python: create a proposal in a dashboard. I approve manually. Then the orchestrator pushes and creates the MR.

The agent is only invoked in step 2 and the retry loop in step 3. Everything else is deterministic.

Stuff I haven't figured out yet:

- Cross-repo features. The agent loses the thread when something spans services.

- Vague tickets. It produces reasonable but wrong code from ambiguous specs.

- Scope creep. It adds error handling and abstractions for things nobody asked for.

- Long sessions. Earlier context falls out of attention.

Full writeup with the architecture diagrams, governance model, knowledge layer, and the failure case that taught me the most:

https://pixari.dev/ai-assisted-product-engineering/

Would appreciate feedback, especially from anyone who has built something similar.