I’ve been thinking about a failure mode that keeps showing up in agent systems.Agents can already take actions:1.write code 2.send emails 3.run commands.Most safeguards today seem to live in:1.prompts (“don’t do X”). 2. or tool restrictions (limit what can be called).But once an agent decides to act and reaches a capable tool, there often isn’t a clear “final checkpoint” before execution.That raises a question:Where should control actually live?Some approaches focus on making the model more reliable (prompting, alignment, etc).Others rely on system design (workflows, permissions, structured environments).But I’m wondering if there’s a missing layer:something that explicitly decides whether an action is allowed to execute — after it’s proposed, but before it runs.For example:instead of directly executing, an agent emits an “action proposal”, and another component decides whether it can proceed.Curious how people here are handling this in practice:1.Do you rely mostly on tool-level restrictions?2. Do you have an approval step for high-risk actions?3. Or is this overengineering?
Happy to share what I’ve been experimenting with if it’s useful.