After 18 months of building, we're open-sourcing our entire production AI agent stack. Here's what's actually in it. If anyone wants to see how it works, happy to share a demo.
Hey everyone 👋
18 months ago we started building internal tooling because nothing in the market covered what we actually needed: a full production loop for AI agents, not just one piece of it.
Tracking without evaluating means that something is wrong. If you don't simulate the evaluation, you'll only find out when you release. If you don't have a feedback process, optimization is just changing prompts and hope that it works. Guardrails put on after the event miss the most important failures.
So we built the full loop. And in a few days, all of it goes open source.
Self host it. Extend it. Ship AI that improves itself.
What's actually shipping:
traceAI: OpenTelemetry-native tracing for 22+ Python and 8+ TypeScript frameworks. Your traces, your backend, no lock-in.
ai-evaluation: 70+ metrics: hallucination, factual accuracy, relevance, safety, compliance. Every scoring function is in the repo. Read it, modify it, run it in CI/CD.
simulate-sdk: Synthetic test conversations at scale for voice and chat agents. Your agent works on 10 test cases. simulate-sdk throws 500 adversarial ones at it before users do.
agent-opt: Feeds failed eval cases into a prompt optimization loop and re-evaluates the output against those exact failures. Closes the gap between "we found a problem" and "we fixed it."
Protect: Real-time input and output guardrails across content moderation, bias detection, prompt injection, and PII compliance. Text, image, and audio.
futureagi-sdk: One interface that connects all of the above.
Not a community edition. Same code running behind the platform.
Three questions for the devs here, we would like to know:
- When your AI agent fails in production, how long does it take you to find which step caused it, the retrieval, the prompt, the tool call, or the model output?
- Have you ever shipped a prompt change that improved one metric but quietly broke something else downstream, and only caught it after users hit it?
- If you self-host your eval pipeline inside your own VPC, what's the biggest operational issue: maintaining the infra, keeping metrics updated, or getting the rest of the team to actually run evals before deploying?
DM if you want early access or want to see a specific part of the stack in action before the public release.