u/dupa1234s

How do you make agents run for hours, and what architectures are actually agent-friendly?#deep-dive #vibe-coder-issues

This is mostly aimed at vibe coders who are unable to or don't want to guide agent every 10 minutes.

My two biggest questions are:

  1. How do you actually make a coding agent keep working for at least 1 hour, ideally 8–20 hours without constantly telling it to continue?
  2. What language/framework/architecture is actually agent-friendly for a local app that integrates many existing technologies and has a lot of real-time-ish flows?

The first question is the immediate practical one.

How on earth do people make these agents keep running?

Unless I write some script that watches the terminal and keeps sending:

«continue unless you are fully done; if you are fully done, say DONE as your last word»

or unless I build some server hook / automation loop around the agent, it just keeps stopping. It finishes when I do not want it to finish. It reports halfway through the plan. It asks for input when there is nothing useful for me to evaluate yet.

So I’m asking very practically: what are people doing right now to make agents actually work for long stretches?

The second question is about architecture.

I’m trying to figure out what kinds of architectures are actually good for AI-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.

I thought an event-driven architecture might be good for this. I tried going in that direction with NATS-style communication. But my current impression is that agents are not good at it. Maybe I did something wrong, but it felt like the agent became terrible at reasoning about the system once everything was happening through events.

If the agent has to understand the system by reading event logs, tracing IDs, and reconstructing causality from a stream of messages, that feels like a bad fit. Maybe this is just not agent-friendly, at least not for a solo/vibe-coded local application.

So the deeper question is:

«What architecture makes an AI agent unusually good at maintaining and extending the project?»

Not what architecture is theoretically elegant. Not what architecture is optimal for a senior engineering team. What architecture is actually easiest for the model to reason about, test, debug, and extend?

The rough workflow I want is:

  1. Put the model on extra-high thinking.
  2. Give it a messy pile of project material: old specs, notes, partial repos, failed ideas, design thoughts, todos, architecture sketches, etc.
  3. Make it spend serious effort organizing that into a usable knowledge base.
  4. I review/correct that knowledge base.
  5. Then make it spend serious effort writing the implementation plan.
  6. I review/correct the plan.
  7. Then make it execute for a long stretch in a sandbox without constantly stopping and asking me to say “continue.”

Roughly:

«1 hour knowledge organization
1 hour implementation planning
20 hours execution»

The exact numbers are not the point. The point is depth and continuity.

I do not want the model to spend 5 minutes writing a plan, 10 minutes coding, and then report “done.”

The first problem is messy context.

If I give an LLM a bunch of files, old specs, old ideas, and previous attempts, it often treats everything as if it was written today and is equally valid. But half the material may be obsolete, contradicted, abandoned, experimental, or from a failed attempt.

The model does not magically know the status of each piece of knowledge.

So I feel like there needs to be an explicit intermediate stage: not coding, not planning, but knowledge organization.

Something like:

- current requirement
- old requirement
- obsolete idea
- failed attempt
- unresolved question
- architectural constraint
- implementation detail
- still-useful note
- contradicted by later note
- needs user confirmation

Then I can correct the knowledge map before the model starts planning.

That seems much more useful than dumping 50 files into context and hoping the model “gets it.”

Is anyone using tools/workflows that actually do this well?

The second problem is shallow plan mode.

A lot of current “plan mode” workflows feel shallow. The model asks two or three questions, writes a short plan, and then acts like it has enough alignment.

But that is not what I want.

I want the model to actually spend real effort thinking through the system before writing code.

People always say some version of:

«5 minutes of planning saves an hour of work.»

Fine. Has anyone actually made that real with LLM coding agents?

Because right now a lot of agent planning feels like a formality. It asks a few questions, writes a plan, and then immediately wants to start coding. Or it keeps rewriting the whole plan over and over instead of thinking deeply first and then writing a stable plan.

Maybe the missing workflow is not just “plan mode.” Maybe it is something like:

«plan the planning → organize the knowledge → ask real questions → write the implementation plan → execute until the plan is actually complete»

The third problem is premature reporting.

This is probably my biggest issue.

The model writes an implementation plan. I review the implementation plan. Then it starts implementing. Then it stops halfway and reports back.

Why?

If I already reviewed the implementation plan, why does it need me to keep saying “continue implementing the plan”?

If it has not hit a fundamental blocker, if the plan has not become invalid, and if there is nothing genuinely useful for me to evaluate yet, why is it reporting at all?

A lot of completion reports are basically just the implementation plan rewritten in past tense:

«I added X.
I implemented Y.
I updated Z.»

That is not useful to me.

For a vibe coder, I do not want to inspect a pile of changed files. I do not want a past-tense summary of the plan. I do not want a fake checkpoint that exists only because the agent decided to stop.

What I want is one of these:

  1. A working thing I can actually run.
  2. A clear presentation layer that shows me something tangible.
  3. Exact instructions for how to test it and what to look for.
  4. A genuinely important question that changes the plan.
  5. A real blocker that prevents progress.
  6. Or, if none of those apply, just keep executing.

If the current work is still mostly mocks, scaffolding, internal wiring, or abstract architecture, then there may be nothing useful for me to evaluate yet.

In that case, why stop?

Why not finish the planned implementation first, then let me test and evaluate when there is actually something to evaluate?

Whose time is more precious: mine, or the agent’s?

I am not saying the agent should never stop. It should stop if:

- the plan is fundamentally wrong
- a major architectural decision is needed
- a blocker cannot be resolved
- it has something real and testable to show
- continuing would obviously waste a lot of work

But if it is just stopping because it completed “some steps,” that feels useless.

The fourth problem is making agents actually work for long stretches.

How are people actually spending their token budgets productively?

With some subscriptions and API setups, the amount of possible usage is huge. But in practice, I find it hard to spend it well because the agent keeps stopping, asking for input, or producing reports that do not help.

How do you make an agent execute for one hour, eight hours, or overnight?

Can you actually do this in a useful way right now?

Do you use scripts that automatically send continuation prompts? Do you use hooks? Do you run agents inside some kind of supervisor process? Do you use a specific tool that already solves this? Or is the answer simply that current agents cannot really do this yet without external automation?

I have tried or looked into OpenCode, OpenClaw, Gemini, Claude, Codex, Pi, and a bunch of Kanban-board-style workflows.

My current impression is that OpenCode with Docker sandboxes is one of the more practical setups. Terminal UIs feel more reliable to me than a lot of GUI agent setups, and Docker sandboxes feel like a decent practical compromise, especially on Windows if you do not want to deal with a full WSL workflow. Not saying WSL is bad, and obviously sandbox security is its own topic, but Docker sandboxes feel convenient.

I have not deeply tried the “agents roleplay an organization” style of workflow. Maybe I should before judging it. But from the outside, I worry that a lot of multi-agent setups become corporate roleplay: workers praising each other, moving cards around, doing shallow reviews, and spending my money on simulated middle management.

Is there a recommended setup that actually achieves the goal?

Not roleplay. Not card movement. Not fake review loops.

Actual useful long-running work.

The fifth problem is language/framework choice.

For AI-heavy coding, I’m starting to think one of the most important constraints is:

«Is the model actually good at working with this language, framework, and project structure?»

For normal engineering, you might pick something because it is technically optimal, elegant, fast, scalable, or theoretically clean.

But if the main implementer/maintainer is an LLM, model proficiency becomes a first-class constraint.

A boring, widely represented stack may beat a technically superior stack if the model is much better at writing, debugging, testing, and extending it.

This seems especially important for vibe coders. If the agent is eventually supposed to handle tens of thousands of lines, I care less about what is theoretically elegant and more about what the model can reliably modify without causing cascading breakage.

Are there good benchmarks or practical community knowledge on which languages/frameworks current models handle best?

The sixth problem is architecture.

I’m trying to figure out what kinds of architectures are actually good for AI-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.

At first, it is tempting to optimize for extensibility:

- make everything swappable
- make everything modular
- make it easy to add new components
- make components communicate through clean boundaries

But I’m starting to think extensibility matters less than maintainability at the beginning.

The first priority is making the thing actually possible to reason about, test, repair, and expand without every change breaking ten other things.

So maybe the default should be:

- clear component boundaries
- explicit interfaces
- boring communication patterns
- deterministic tests where possible
- mocks at boundaries
- real pressure points represented in tests
- replace one mocked component at a time with a real component
- every component can be tested in isolation

Basically: make the architecture agent-legible before making it powerful.

A folder structure template is not enough. I’m more interested in reusable architecture templates where the component communication, boundaries, testing strategy, and failure modes are already thought through.

Do repos like this exist?

Not just:

«here is a folder layout»

but more like:

«here is a healthy skeleton for building a local multi-component application that an agent can keep extending without turning it into spaghetti»

The seventh problem is orchestration.

Do Kanban boards, orchestrator/worker setups, and multi-agent systems actually help with this?

A static task board seems limited because after task 3 is done, task 8 may no longer make sense. Someone has to re-evaluate the plan. The agent needs to manage its own work, not just move tasks from “todo” to “done.”

Maybe persistent sub-agents/workers would help. For example:

- one worker owns tests
- one worker owns architecture
- one worker owns a subsystem
- one worker owns documentation/knowledge state

But that can also become useless roleplay if it is not grounded in real artifacts.

Has anyone found a multi-agent workflow that actually works for this kind of long execution?

The eighth problem is whether my preferred approach is even optimal.

Maybe this workflow:

«organize sources → plan deeply → execute for a long stretch»

is worse than:

«run multiple worktrees/agents in parallel with different constraints → compare implementations → keep the best ideas»

That might be a better way to spend a large token budget.

But it also creates another problem: now I have to review multiple implementations, fix multiple broken versions enough to compare them, and give slightly different instructions to each branch.

Has anyone compared these approaches in practice?

  1. One deep workflow that spends a lot of effort organizing knowledge, planning, and then executing for a long stretch.
  2. Multiple parallel worktrees/agents generating competing implementations that you compare afterward.

Which one actually works better for non-trivial projects?

My questions:

  1. How do you make coding agents keep working for 8–20 hours without constantly telling them to continue?
  2. Are there tools/workflows that first organize a messy project knowledge base before planning?
  3. Are there serious AI planning workflows that go deeper than current shallow “plan mode”?
  4. How do you stop agents from reporting halfway through the plan unless there is something actually worth showing?
  5. What languages/frameworks are currently most agent-friendly in practice?
  6. What architectures are actually good for AI-maintained local applications with many flows/components?
  7. Are event-driven/message-based architectures just a bad fit for AI-maintained projects, or am I using them wrong?
  8. Are there reusable architecture templates that define healthy component communication, not just folder structure?
  9. Is it better to run one deep workflow, or multiple parallel worktrees/agents and compare outputs?
  10. What does your actual overnight or long-running AI coding workflow look like?

I am not asking for hype, future predictions, or emotional takes.

I’m asking this in the most practical way possible.

Maybe my framing is wrong. Maybe the real bottleneck is somewhere else. If so, criticize the premise.

I mostly want to know what people are actually doing right now that works.

Sorry for ai generating this, but I made sure to review it bunch of times.

reddit.com
u/dupa1234s — 11 hours ago

How do you make agents run for hours, and what architectures are actually agent-friendly?#deep-dive #vibe-coder-issues

This is mostly aimed at vibe coders who are unable to or don't want to guide agent every 10 minutes.

My two biggest questions are:

  1. How do you actually make a coding agent keep working for at least 1 hour, ideally 8–20 hours without constantly telling it to continue?
  2. What language/framework/architecture is actually agent-friendly for a local app that integrates many existing technologies and has a lot of real-time-ish flows?

The first question is the immediate practical one.

How on earth do people make these agents keep running?

Unless I write some script that watches the terminal and keeps sending:

«continue unless you are fully done; if you are fully done, say DONE as your last word»

or unless I build some server hook / automation loop around the agent, it just keeps stopping. It finishes when I do not want it to finish. It reports halfway through the plan. It asks for input when there is nothing useful for me to evaluate yet.

So I’m asking very practically: what are people doing right now to make agents actually work for long stretches?

The second question is about architecture.

I’m trying to figure out what kinds of architectures are actually good for AI-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.

I thought an event-driven architecture might be good for this. I tried going in that direction with NATS-style communication. But my current impression is that agents are not good at it. Maybe I did something wrong, but it felt like the agent became terrible at reasoning about the system once everything was happening through events.

If the agent has to understand the system by reading event logs, tracing IDs, and reconstructing causality from a stream of messages, that feels like a bad fit. Maybe this is just not agent-friendly, at least not for a solo/vibe-coded local application.

So the deeper question is:

«What architecture makes an AI agent unusually good at maintaining and extending the project?»

Not what architecture is theoretically elegant. Not what architecture is optimal for a senior engineering team. What architecture is actually easiest for the model to reason about, test, debug, and extend?

The rough workflow I want is:

  1. Put the model on extra-high thinking.
  2. Give it a messy pile of project material: old specs, notes, partial repos, failed ideas, design thoughts, todos, architecture sketches, etc.
  3. Make it spend serious effort organizing that into a usable knowledge base.
  4. I review/correct that knowledge base.
  5. Then make it spend serious effort writing the implementation plan.
  6. I review/correct the plan.
  7. Then make it execute for a long stretch in a sandbox without constantly stopping and asking me to say “continue.”

Roughly:

«1 hour knowledge organization
1 hour implementation planning
20 hours execution»

The exact numbers are not the point. The point is depth and continuity.

I do not want the model to spend 5 minutes writing a plan, 10 minutes coding, and then report “done.”

The first problem is messy context.

If I give an LLM a bunch of files, old specs, old ideas, and previous attempts, it often treats everything as if it was written today and is equally valid. But half the material may be obsolete, contradicted, abandoned, experimental, or from a failed attempt.

The model does not magically know the status of each piece of knowledge.

So I feel like there needs to be an explicit intermediate stage: not coding, not planning, but knowledge organization.

Something like:

- current requirement
- old requirement
- obsolete idea
- failed attempt
- unresolved question
- architectural constraint
- implementation detail
- still-useful note
- contradicted by later note
- needs user confirmation

Then I can correct the knowledge map before the model starts planning.

That seems much more useful than dumping 50 files into context and hoping the model “gets it.”

Is anyone using tools/workflows that actually do this well?

The second problem is shallow plan mode.

A lot of current “plan mode” workflows feel shallow. The model asks two or three questions, writes a short plan, and then acts like it has enough alignment.

But that is not what I want.

I want the model to actually spend real effort thinking through the system before writing code.

People always say some version of:

«5 minutes of planning saves an hour of work.»

Fine. Has anyone actually made that real with LLM coding agents?

Because right now a lot of agent planning feels like a formality. It asks a few questions, writes a plan, and then immediately wants to start coding. Or it keeps rewriting the whole plan over and over instead of thinking deeply first and then writing a stable plan.

Maybe the missing workflow is not just “plan mode.” Maybe it is something like:

«plan the planning → organize the knowledge → ask real questions → write the implementation plan → execute until the plan is actually complete»

The third problem is premature reporting.

This is probably my biggest issue.

The model writes an implementation plan. I review the implementation plan. Then it starts implementing. Then it stops halfway and reports back.

Why?

If I already reviewed the implementation plan, why does it need me to keep saying “continue implementing the plan”?

If it has not hit a fundamental blocker, if the plan has not become invalid, and if there is nothing genuinely useful for me to evaluate yet, why is it reporting at all?

A lot of completion reports are basically just the implementation plan rewritten in past tense:

«I added X.
I implemented Y.
I updated Z.»

That is not useful to me.

For a vibe coder, I do not want to inspect a pile of changed files. I do not want a past-tense summary of the plan. I do not want a fake checkpoint that exists only because the agent decided to stop.

What I want is one of these:

  1. A working thing I can actually run.
  2. A clear presentation layer that shows me something tangible.
  3. Exact instructions for how to test it and what to look for.
  4. A genuinely important question that changes the plan.
  5. A real blocker that prevents progress.
  6. Or, if none of those apply, just keep executing.

If the current work is still mostly mocks, scaffolding, internal wiring, or abstract architecture, then there may be nothing useful for me to evaluate yet.

In that case, why stop?

Why not finish the planned implementation first, then let me test and evaluate when there is actually something to evaluate?

Whose time is more precious: mine, or the agent’s?

I am not saying the agent should never stop. It should stop if:

- the plan is fundamentally wrong
- a major architectural decision is needed
- a blocker cannot be resolved
- it has something real and testable to show
- continuing would obviously waste a lot of work

But if it is just stopping because it completed “some steps,” that feels useless.

The fourth problem is making agents actually work for long stretches.

How are people actually spending their token budgets productively?

With some subscriptions and API setups, the amount of possible usage is huge. But in practice, I find it hard to spend it well because the agent keeps stopping, asking for input, or producing reports that do not help.

How do you make an agent execute for one hour, eight hours, or overnight?

Can you actually do this in a useful way right now?

Do you use scripts that automatically send continuation prompts? Do you use hooks? Do you run agents inside some kind of supervisor process? Do you use a specific tool that already solves this? Or is the answer simply that current agents cannot really do this yet without external automation?

I have tried or looked into OpenCode, OpenClaw, Gemini, Claude, Codex, Pi, and a bunch of Kanban-board-style workflows.

My current impression is that OpenCode with Docker sandboxes is one of the more practical setups. Terminal UIs feel more reliable to me than a lot of GUI agent setups, and Docker sandboxes feel like a decent practical compromise, especially on Windows if you do not want to deal with a full WSL workflow. Not saying WSL is bad, and obviously sandbox security is its own topic, but Docker sandboxes feel convenient.

I have not deeply tried the “agents roleplay an organization” style of workflow. Maybe I should before judging it. But from the outside, I worry that a lot of multi-agent setups become corporate roleplay: workers praising each other, moving cards around, doing shallow reviews, and spending my money on simulated middle management.

Is there a recommended setup that actually achieves the goal?

Not roleplay. Not card movement. Not fake review loops.

Actual useful long-running work.

The fifth problem is language/framework choice.

For AI-heavy coding, I’m starting to think one of the most important constraints is:

«Is the model actually good at working with this language, framework, and project structure?»

For normal engineering, you might pick something because it is technically optimal, elegant, fast, scalable, or theoretically clean.

But if the main implementer/maintainer is an LLM, model proficiency becomes a first-class constraint.

A boring, widely represented stack may beat a technically superior stack if the model is much better at writing, debugging, testing, and extending it.

This seems especially important for vibe coders. If the agent is eventually supposed to handle tens of thousands of lines, I care less about what is theoretically elegant and more about what the model can reliably modify without causing cascading breakage.

Are there good benchmarks or practical community knowledge on which languages/frameworks current models handle best?

The sixth problem is architecture.

I’m trying to figure out what kinds of architectures are actually good for AI-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.

At first, it is tempting to optimize for extensibility:

- make everything swappable
- make everything modular
- make it easy to add new components
- make components communicate through clean boundaries

But I’m starting to think extensibility matters less than maintainability at the beginning.

The first priority is making the thing actually possible to reason about, test, repair, and expand without every change breaking ten other things.

So maybe the default should be:

- clear component boundaries
- explicit interfaces
- boring communication patterns
- deterministic tests where possible
- mocks at boundaries
- real pressure points represented in tests
- replace one mocked component at a time with a real component
- every component can be tested in isolation

Basically: make the architecture agent-legible before making it powerful.

A folder structure template is not enough. I’m more interested in reusable architecture templates where the component communication, boundaries, testing strategy, and failure modes are already thought through.

Do repos like this exist?

Not just:

«here is a folder layout»

but more like:

«here is a healthy skeleton for building a local multi-component application that an agent can keep extending without turning it into spaghetti»

The seventh problem is orchestration.

Do Kanban boards, orchestrator/worker setups, and multi-agent systems actually help with this?

A static task board seems limited because after task 3 is done, task 8 may no longer make sense. Someone has to re-evaluate the plan. The agent needs to manage its own work, not just move tasks from “todo” to “done.”

Maybe persistent sub-agents/workers would help. For example:

- one worker owns tests
- one worker owns architecture
- one worker owns a subsystem
- one worker owns documentation/knowledge state

But that can also become useless roleplay if it is not grounded in real artifacts.

Has anyone found a multi-agent workflow that actually works for this kind of long execution?

The eighth problem is whether my preferred approach is even optimal.

Maybe this workflow:

«organize sources → plan deeply → execute for a long stretch»

is worse than:

«run multiple worktrees/agents in parallel with different constraints → compare implementations → keep the best ideas»

That might be a better way to spend a large token budget.

But it also creates another problem: now I have to review multiple implementations, fix multiple broken versions enough to compare them, and give slightly different instructions to each branch.

Has anyone compared these approaches in practice?

  1. One deep workflow that spends a lot of effort organizing knowledge, planning, and then executing for a long stretch.
  2. Multiple parallel worktrees/agents generating competing implementations that you compare afterward.

Which one actually works better for non-trivial projects?

My questions:

  1. How do you make coding agents keep working for 8–20 hours without constantly telling them to continue?
  2. Are there tools/workflows that first organize a messy project knowledge base before planning?
  3. Are there serious AI planning workflows that go deeper than current shallow “plan mode”?
  4. How do you stop agents from reporting halfway through the plan unless there is something actually worth showing?
  5. What languages/frameworks are currently most agent-friendly in practice?
  6. What architectures are actually good for AI-maintained local applications with many flows/components?
  7. Are event-driven/message-based architectures just a bad fit for AI-maintained projects, or am I using them wrong?
  8. Are there reusable architecture templates that define healthy component communication, not just folder structure?
  9. Is it better to run one deep workflow, or multiple parallel worktrees/agents and compare outputs?
  10. What does your actual overnight or long-running AI coding workflow look like?

I am not asking for hype, future predictions, or emotional takes.

I’m asking this in the most practical way possible.

Maybe my framing is wrong. Maybe the real bottleneck is somewhere else. If so, criticize the premise.

I mostly want to know what people are actually doing right now that works.

Sorry for ai generating this, but I made sure to review it bunch of times.

reddit.com
u/dupa1234s — 11 hours ago

How do you make agents run for hours, and what architectures are actually agent-friendly?#deep-dive #vibe-coder-issues

This is mostly aimed at vibe coders who are unable to or don't want to guide agent every 10 minutes.

My two biggest questions are:

  1. How do you actually make a coding agent keep working for at least 1 hour, ideally 8–20 hours without constantly telling it to continue?
  2. What language/framework/architecture is actually agent-friendly for a local app that integrates many existing technologies and has a lot of real-time-ish flows?

The first question is the immediate practical one.

How on earth do people make these agents keep running?

Unless I write some script that watches the terminal and keeps sending:

«continue unless you are fully done; if you are fully done, say DONE as your last word»

or unless I build some server hook / automation loop around the agent, it just keeps stopping. It finishes when I do not want it to finish. It reports halfway through the plan. It asks for input when there is nothing useful for me to evaluate yet.

So I’m asking very practically: what are people doing right now to make agents actually work for long stretches?

The second question is about architecture.

I’m trying to figure out what kinds of architectures are actually good for AI-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.

I thought an event-driven architecture might be good for this. I tried going in that direction with NATS-style communication. But my current impression is that agents are not good at it. Maybe I did something wrong, but it felt like the agent became terrible at reasoning about the system once everything was happening through events.

If the agent has to understand the system by reading event logs, tracing IDs, and reconstructing causality from a stream of messages, that feels like a bad fit. Maybe this is just not agent-friendly, at least not for a solo/vibe-coded local application.

So the deeper question is:

«What architecture makes an AI agent unusually good at maintaining and extending the project?»

Not what architecture is theoretically elegant. Not what architecture is optimal for a senior engineering team. What architecture is actually easiest for the model to reason about, test, debug, and extend?

The rough workflow I want is:

  1. Put the model on extra-high thinking.
  2. Give it a messy pile of project material: old specs, notes, partial repos, failed ideas, design thoughts, todos, architecture sketches, etc.
  3. Make it spend serious effort organizing that into a usable knowledge base.
  4. I review/correct that knowledge base.
  5. Then make it spend serious effort writing the implementation plan.
  6. I review/correct the plan.
  7. Then make it execute for a long stretch in a sandbox without constantly stopping and asking me to say “continue.”

Roughly:

«1 hour knowledge organization
1 hour implementation planning
20 hours execution»

The exact numbers are not the point. The point is depth and continuity.

I do not want the model to spend 5 minutes writing a plan, 10 minutes coding, and then report “done.”

The first problem is messy context.

If I give an LLM a bunch of files, old specs, old ideas, and previous attempts, it often treats everything as if it was written today and is equally valid. But half the material may be obsolete, contradicted, abandoned, experimental, or from a failed attempt.

The model does not magically know the status of each piece of knowledge.

So I feel like there needs to be an explicit intermediate stage: not coding, not planning, but knowledge organization.

Something like:

- current requirement
- old requirement
- obsolete idea
- failed attempt
- unresolved question
- architectural constraint
- implementation detail
- still-useful note
- contradicted by later note
- needs user confirmation

Then I can correct the knowledge map before the model starts planning.

That seems much more useful than dumping 50 files into context and hoping the model “gets it.”

Is anyone using tools/workflows that actually do this well?

The second problem is shallow plan mode.

A lot of current “plan mode” workflows feel shallow. The model asks two or three questions, writes a short plan, and then acts like it has enough alignment.

But that is not what I want.

I want the model to actually spend real effort thinking through the system before writing code.

People always say some version of:

«5 minutes of planning saves an hour of work.»

Fine. Has anyone actually made that real with LLM coding agents?

Because right now a lot of agent planning feels like a formality. It asks a few questions, writes a plan, and then immediately wants to start coding. Or it keeps rewriting the whole plan over and over instead of thinking deeply first and then writing a stable plan.

Maybe the missing workflow is not just “plan mode.” Maybe it is something like:

«plan the planning → organize the knowledge → ask real questions → write the implementation plan → execute until the plan is actually complete»

The third problem is premature reporting.

This is probably my biggest issue.

The model writes an implementation plan. I review the implementation plan. Then it starts implementing. Then it stops halfway and reports back.

Why?

If I already reviewed the implementation plan, why does it need me to keep saying “continue implementing the plan”?

If it has not hit a fundamental blocker, if the plan has not become invalid, and if there is nothing genuinely useful for me to evaluate yet, why is it reporting at all?

A lot of completion reports are basically just the implementation plan rewritten in past tense:

«I added X.
I implemented Y.
I updated Z.»

That is not useful to me.

For a vibe coder, I do not want to inspect a pile of changed files. I do not want a past-tense summary of the plan. I do not want a fake checkpoint that exists only because the agent decided to stop.

What I want is one of these:

  1. A working thing I can actually run.
  2. A clear presentation layer that shows me something tangible.
  3. Exact instructions for how to test it and what to look for.
  4. A genuinely important question that changes the plan.
  5. A real blocker that prevents progress.
  6. Or, if none of those apply, just keep executing.

If the current work is still mostly mocks, scaffolding, internal wiring, or abstract architecture, then there may be nothing useful for me to evaluate yet.

In that case, why stop?

Why not finish the planned implementation first, then let me test and evaluate when there is actually something to evaluate?

Whose time is more precious: mine, or the agent’s?

I am not saying the agent should never stop. It should stop if:

- the plan is fundamentally wrong
- a major architectural decision is needed
- a blocker cannot be resolved
- it has something real and testable to show
- continuing would obviously waste a lot of work

But if it is just stopping because it completed “some steps,” that feels useless.

The fourth problem is making agents actually work for long stretches.

How are people actually spending their token budgets productively?

With some subscriptions and API setups, the amount of possible usage is huge. But in practice, I find it hard to spend it well because the agent keeps stopping, asking for input, or producing reports that do not help.

How do you make an agent execute for one hour, eight hours, or overnight?

Can you actually do this in a useful way right now?

Do you use scripts that automatically send continuation prompts? Do you use hooks? Do you run agents inside some kind of supervisor process? Do you use a specific tool that already solves this? Or is the answer simply that current agents cannot really do this yet without external automation?

I have tried or looked into OpenCode, OpenClaw, Gemini, Claude, Codex, Pi, and a bunch of Kanban-board-style workflows.

My current impression is that OpenCode with Docker sandboxes is one of the more practical setups. Terminal UIs feel more reliable to me than a lot of GUI agent setups, and Docker sandboxes feel like a decent practical compromise, especially on Windows if you do not want to deal with a full WSL workflow. Not saying WSL is bad, and obviously sandbox security is its own topic, but Docker sandboxes feel convenient.

I have not deeply tried the “agents roleplay an organization” style of workflow. Maybe I should before judging it. But from the outside, I worry that a lot of multi-agent setups become corporate roleplay: workers praising each other, moving cards around, doing shallow reviews, and spending my money on simulated middle management.

Is there a recommended setup that actually achieves the goal?

Not roleplay. Not card movement. Not fake review loops.

Actual useful long-running work.

The fifth problem is language/framework choice.

For AI-heavy coding, I’m starting to think one of the most important constraints is:

«Is the model actually good at working with this language, framework, and project structure?»

For normal engineering, you might pick something because it is technically optimal, elegant, fast, scalable, or theoretically clean.

But if the main implementer/maintainer is an LLM, model proficiency becomes a first-class constraint.

A boring, widely represented stack may beat a technically superior stack if the model is much better at writing, debugging, testing, and extending it.

This seems especially important for vibe coders. If the agent is eventually supposed to handle tens of thousands of lines, I care less about what is theoretically elegant and more about what the model can reliably modify without causing cascading breakage.

Are there good benchmarks or practical community knowledge on which languages/frameworks current models handle best?

The sixth problem is architecture.

I’m trying to figure out what kinds of architectures are actually good for AI-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.

At first, it is tempting to optimize for extensibility:

- make everything swappable
- make everything modular
- make it easy to add new components
- make components communicate through clean boundaries

But I’m starting to think extensibility matters less than maintainability at the beginning.

The first priority is making the thing actually possible to reason about, test, repair, and expand without every change breaking ten other things.

So maybe the default should be:

- clear component boundaries
- explicit interfaces
- boring communication patterns
- deterministic tests where possible
- mocks at boundaries
- real pressure points represented in tests
- replace one mocked component at a time with a real component
- every component can be tested in isolation

Basically: make the architecture agent-legible before making it powerful.

A folder structure template is not enough. I’m more interested in reusable architecture templates where the component communication, boundaries, testing strategy, and failure modes are already thought through.

Do repos like this exist?

Not just:

«here is a folder layout»

but more like:

«here is a healthy skeleton for building a local multi-component application that an agent can keep extending without turning it into spaghetti»

The seventh problem is orchestration.

Do Kanban boards, orchestrator/worker setups, and multi-agent systems actually help with this?

A static task board seems limited because after task 3 is done, task 8 may no longer make sense. Someone has to re-evaluate the plan. The agent needs to manage its own work, not just move tasks from “todo” to “done.”

Maybe persistent sub-agents/workers would help. For example:

- one worker owns tests
- one worker owns architecture
- one worker owns a subsystem
- one worker owns documentation/knowledge state

But that can also become useless roleplay if it is not grounded in real artifacts.

Has anyone found a multi-agent workflow that actually works for this kind of long execution?

The eighth problem is whether my preferred approach is even optimal.

Maybe this workflow:

«organize sources → plan deeply → execute for a long stretch»

is worse than:

«run multiple worktrees/agents in parallel with different constraints → compare implementations → keep the best ideas»

That might be a better way to spend a large token budget.

But it also creates another problem: now I have to review multiple implementations, fix multiple broken versions enough to compare them, and give slightly different instructions to each branch.

Has anyone compared these approaches in practice?

  1. One deep workflow that spends a lot of effort organizing knowledge, planning, and then executing for a long stretch.
  2. Multiple parallel worktrees/agents generating competing implementations that you compare afterward.

Which one actually works better for non-trivial projects?

My questions:

  1. How do you make coding agents keep working for 8–20 hours without constantly telling them to continue?
  2. Are there tools/workflows that first organize a messy project knowledge base before planning?
  3. Are there serious AI planning workflows that go deeper than current shallow “plan mode”?
  4. How do you stop agents from reporting halfway through the plan unless there is something actually worth showing?
  5. What languages/frameworks are currently most agent-friendly in practice?
  6. What architectures are actually good for AI-maintained local applications with many flows/components?
  7. Are event-driven/message-based architectures just a bad fit for AI-maintained projects, or am I using them wrong?
  8. Are there reusable architecture templates that define healthy component communication, not just folder structure?
  9. Is it better to run one deep workflow, or multiple parallel worktrees/agents and compare outputs?
  10. What does your actual overnight or long-running AI coding workflow look like?

I am not asking for hype, future predictions, or emotional takes.

I’m asking this in the most practical way possible.

Maybe my framing is wrong. Maybe the real bottleneck is somewhere else. If so, criticize the premise.

I mostly want to know what people are actually doing right now that works.

Sorry for ai generating this, but I made sure to review it bunch of times.

reddit.com
u/dupa1234s — 12 hours ago

How do you make agents run for hours, and what architectures are actually agent-friendly?#deep-dive #vibe-coder-issues

This is mostly aimed at vibe coders who are unable to or don't want to guide agent every 10 minutes.

My two biggest questions are:

  1. How do you actually make a coding agent keep working for at least 1 hour, ideally 8–20 hours without constantly telling it to continue?
  2. What language/framework/architecture is actually agent-friendly for a local app that integrates many existing technologies and has a lot of real-time-ish flows?

The first question is the immediate practical one.

How on earth do people make these agents keep running?

Unless I write some script that watches the terminal and keeps sending:

«continue unless you are fully done; if you are fully done, say DONE as your last word»

or unless I build some server hook / automation loop around the agent, it just keeps stopping. It finishes when I do not want it to finish. It reports halfway through the plan. It asks for input when there is nothing useful for me to evaluate yet.

So I’m asking very practically: what are people doing right now to make agents actually work for long stretches?

The second question is about architecture.

I’m trying to figure out what kinds of architectures are actually good for AI-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.

I thought an event-driven architecture might be good for this. I tried going in that direction with NATS-style communication. But my current impression is that agents are not good at it. Maybe I did something wrong, but it felt like the agent became terrible at reasoning about the system once everything was happening through events.

If the agent has to understand the system by reading event logs, tracing IDs, and reconstructing causality from a stream of messages, that feels like a bad fit. Maybe this is just not agent-friendly, at least not for a solo/vibe-coded local application.

So the deeper question is:

«What architecture makes an AI agent unusually good at maintaining and extending the project?»

Not what architecture is theoretically elegant. Not what architecture is optimal for a senior engineering team. What architecture is actually easiest for the model to reason about, test, debug, and extend?

The rough workflow I want is:

  1. Put the model on extra-high thinking.
  2. Give it a messy pile of project material: old specs, notes, partial repos, failed ideas, design thoughts, todos, architecture sketches, etc.
  3. Make it spend serious effort organizing that into a usable knowledge base.
  4. I review/correct that knowledge base.
  5. Then make it spend serious effort writing the implementation plan.
  6. I review/correct the plan.
  7. Then make it execute for a long stretch in a sandbox without constantly stopping and asking me to say “continue.”

Roughly:

«1 hour knowledge organization
1 hour implementation planning
20 hours execution»

The exact numbers are not the point. The point is depth and continuity.

I do not want the model to spend 5 minutes writing a plan, 10 minutes coding, and then report “done.”

The first problem is messy context.

If I give an LLM a bunch of files, old specs, old ideas, and previous attempts, it often treats everything as if it was written today and is equally valid. But half the material may be obsolete, contradicted, abandoned, experimental, or from a failed attempt.

The model does not magically know the status of each piece of knowledge.

So I feel like there needs to be an explicit intermediate stage: not coding, not planning, but knowledge organization.

Something like:

- current requirement
- old requirement
- obsolete idea
- failed attempt
- unresolved question
- architectural constraint
- implementation detail
- still-useful note
- contradicted by later note
- needs user confirmation

Then I can correct the knowledge map before the model starts planning.

That seems much more useful than dumping 50 files into context and hoping the model “gets it.”

Is anyone using tools/workflows that actually do this well?

The second problem is shallow plan mode.

A lot of current “plan mode” workflows feel shallow. The model asks two or three questions, writes a short plan, and then acts like it has enough alignment.

But that is not what I want.

I want the model to actually spend real effort thinking through the system before writing code.

People always say some version of:

«5 minutes of planning saves an hour of work.»

Fine. Has anyone actually made that real with LLM coding agents?

Because right now a lot of agent planning feels like a formality. It asks a few questions, writes a plan, and then immediately wants to start coding. Or it keeps rewriting the whole plan over and over instead of thinking deeply first and then writing a stable plan.

Maybe the missing workflow is not just “plan mode.” Maybe it is something like:

«plan the planning → organize the knowledge → ask real questions → write the implementation plan → execute until the plan is actually complete»

The third problem is premature reporting.

This is probably my biggest issue.

The model writes an implementation plan. I review the implementation plan. Then it starts implementing. Then it stops halfway and reports back.

Why?

If I already reviewed the implementation plan, why does it need me to keep saying “continue implementing the plan”?

If it has not hit a fundamental blocker, if the plan has not become invalid, and if there is nothing genuinely useful for me to evaluate yet, why is it reporting at all?

A lot of completion reports are basically just the implementation plan rewritten in past tense:

«I added X.
I implemented Y.
I updated Z.»

That is not useful to me.

For a vibe coder, I do not want to inspect a pile of changed files. I do not want a past-tense summary of the plan. I do not want a fake checkpoint that exists only because the agent decided to stop.

What I want is one of these:

  1. A working thing I can actually run.
  2. A clear presentation layer that shows me something tangible.
  3. Exact instructions for how to test it and what to look for.
  4. A genuinely important question that changes the plan.
  5. A real blocker that prevents progress.
  6. Or, if none of those apply, just keep executing.

If the current work is still mostly mocks, scaffolding, internal wiring, or abstract architecture, then there may be nothing useful for me to evaluate yet.

In that case, why stop?

Why not finish the planned implementation first, then let me test and evaluate when there is actually something to evaluate?

Whose time is more precious: mine, or the agent’s?

I am not saying the agent should never stop. It should stop if:

- the plan is fundamentally wrong
- a major architectural decision is needed
- a blocker cannot be resolved
- it has something real and testable to show
- continuing would obviously waste a lot of work

But if it is just stopping because it completed “some steps,” that feels useless.

The fourth problem is making agents actually work for long stretches.

How are people actually spending their token budgets productively?

With some subscriptions and API setups, the amount of possible usage is huge. But in practice, I find it hard to spend it well because the agent keeps stopping, asking for input, or producing reports that do not help.

How do you make an agent execute for one hour, eight hours, or overnight?

Can you actually do this in a useful way right now?

Do you use scripts that automatically send continuation prompts? Do you use hooks? Do you run agents inside some kind of supervisor process? Do you use a specific tool that already solves this? Or is the answer simply that current agents cannot really do this yet without external automation?

I have tried or looked into OpenCode, OpenClaw, Gemini, Claude, Codex, Pi, and a bunch of Kanban-board-style workflows.

My current impression is that OpenCode with Docker sandboxes is one of the more practical setups. Terminal UIs feel more reliable to me than a lot of GUI agent setups, and Docker sandboxes feel like a decent practical compromise, especially on Windows if you do not want to deal with a full WSL workflow. Not saying WSL is bad, and obviously sandbox security is its own topic, but Docker sandboxes feel convenient.

I have not deeply tried the “agents roleplay an organization” style of workflow. Maybe I should before judging it. But from the outside, I worry that a lot of multi-agent setups become corporate roleplay: workers praising each other, moving cards around, doing shallow reviews, and spending my money on simulated middle management.

Is there a recommended setup that actually achieves the goal?

Not roleplay. Not card movement. Not fake review loops.

Actual useful long-running work.

The fifth problem is language/framework choice.

For AI-heavy coding, I’m starting to think one of the most important constraints is:

«Is the model actually good at working with this language, framework, and project structure?»

For normal engineering, you might pick something because it is technically optimal, elegant, fast, scalable, or theoretically clean.

But if the main implementer/maintainer is an LLM, model proficiency becomes a first-class constraint.

A boring, widely represented stack may beat a technically superior stack if the model is much better at writing, debugging, testing, and extending it.

This seems especially important for vibe coders. If the agent is eventually supposed to handle tens of thousands of lines, I care less about what is theoretically elegant and more about what the model can reliably modify without causing cascading breakage.

Are there good benchmarks or practical community knowledge on which languages/frameworks current models handle best?

The sixth problem is architecture.

I’m trying to figure out what kinds of architectures are actually good for AI-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.

At first, it is tempting to optimize for extensibility:

- make everything swappable
- make everything modular
- make it easy to add new components
- make components communicate through clean boundaries

But I’m starting to think extensibility matters less than maintainability at the beginning.

The first priority is making the thing actually possible to reason about, test, repair, and expand without every change breaking ten other things.

So maybe the default should be:

- clear component boundaries
- explicit interfaces
- boring communication patterns
- deterministic tests where possible
- mocks at boundaries
- real pressure points represented in tests
- replace one mocked component at a time with a real component
- every component can be tested in isolation

Basically: make the architecture agent-legible before making it powerful.

A folder structure template is not enough. I’m more interested in reusable architecture templates where the component communication, boundaries, testing strategy, and failure modes are already thought through.

Do repos like this exist?

Not just:

«here is a folder layout»

but more like:

«here is a healthy skeleton for building a local multi-component application that an agent can keep extending without turning it into spaghetti»

The seventh problem is orchestration.

Do Kanban boards, orchestrator/worker setups, and multi-agent systems actually help with this?

A static task board seems limited because after task 3 is done, task 8 may no longer make sense. Someone has to re-evaluate the plan. The agent needs to manage its own work, not just move tasks from “todo” to “done.”

Maybe persistent sub-agents/workers would help. For example:

- one worker owns tests
- one worker owns architecture
- one worker owns a subsystem
- one worker owns documentation/knowledge state

But that can also become useless roleplay if it is not grounded in real artifacts.

Has anyone found a multi-agent workflow that actually works for this kind of long execution?

The eighth problem is whether my preferred approach is even optimal.

Maybe this workflow:

«organize sources → plan deeply → execute for a long stretch»

is worse than:

«run multiple worktrees/agents in parallel with different constraints → compare implementations → keep the best ideas»

That might be a better way to spend a large token budget.

But it also creates another problem: now I have to review multiple implementations, fix multiple broken versions enough to compare them, and give slightly different instructions to each branch.

Has anyone compared these approaches in practice?

  1. One deep workflow that spends a lot of effort organizing knowledge, planning, and then executing for a long stretch.
  2. Multiple parallel worktrees/agents generating competing implementations that you compare afterward.

Which one actually works better for non-trivial projects?

My questions:

  1. How do you make coding agents keep working for 8–20 hours without constantly telling them to continue?
  2. Are there tools/workflows that first organize a messy project knowledge base before planning?
  3. Are there serious AI planning workflows that go deeper than current shallow “plan mode”?
  4. How do you stop agents from reporting halfway through the plan unless there is something actually worth showing?
  5. What languages/frameworks are currently most agent-friendly in practice?
  6. What architectures are actually good for AI-maintained local applications with many flows/components?
  7. Are event-driven/message-based architectures just a bad fit for AI-maintained projects, or am I using them wrong?
  8. Are there reusable architecture templates that define healthy component communication, not just folder structure?
  9. Is it better to run one deep workflow, or multiple parallel worktrees/agents and compare outputs?
  10. What does your actual overnight or long-running AI coding workflow look like?

I am not asking for hype, future predictions, or emotional takes.

I’m asking this in the most practical way possible.

Maybe my framing is wrong. Maybe the real bottleneck is somewhere else. If so, criticize the premise.

I mostly want to know what people are actually doing right now that works.

Sorry for ai generating this, but I made sure to review it bunch of times.

reddit.com
u/dupa1234s — 12 hours ago

Need help: Goal: TUI + server. I tried Codex CLI, Gemini CLI, Claude Code, OpenCode, Pi, and OpenClaw, but none are reliable.

I’m looking for something like what Codex App Server is trying to do.

For example:

codex app-server --listen ws://127.0.0.1:17345

codex --remote ws://127.0.0.1:17345

The thing I want is not just “an agent in a terminal” and not just “an API.”

I want both at the same time:

  1. a real TUI from the tool/provider

  2. a server I can talk to programmatically

The reason this matters is that the TUI already handles a lot of things reasonably well. I don’t want to rebuild the whole client myself just to make a custom UI or some extra automation around it.

What I want is to keep the provider/tool’s TUI for the stuff it already does, while also being able to talk to the same backend/server from my own code. For example, send calls to sessions, control or inspect sessions, build my own UI around it, or automate parts of the workflow.

A nice side effect is when the TUI and my own code are connected to the same session, changes show up immediately in the TUI too. That is not the main requirement, but it is a useful part of the model. I tried a bunch of tools and I keep running into blockers:

- OpenCode: compaction is broken for me. After compaction it can get stuck looping forever. I’ve seen it spam “done” in the chat for hours if left running, burning through tokens.

- Codex: compaction also seems broken on my setup. I keep getting errors, and then I’m forced to start new sessions. That kills the workflow.

- OpenClaw: too much overhead. It can take around a minute just to respond to something basic like “hi.”

- Gemini CLI and Claude Code: as far as I know, they don’t expose this kind of server. So I’d have to build one myself, unless there is already some reliable open-source server layer they connect to.

- Pi / other tools: I still haven’t found something that gives me this TUI + server setup in a way that feels reliable.

The specific bugs above are not really the whole point. The point is that each option I’ve tried fails on the thing I actually need: a reliable terminal UI plus a server interface I can build around.

Ideally I’d prefer Codex, or one tool that can combine multiple providers. Support for Codex/OpenAI, Gemini, and Claude would be a big priority. OAuth support matters too; I’d much rather use OAuth than API keys.

Does anything currently do this reliably?

reddit.com
u/dupa1234s — 8 days ago