u/Future_AGI

▲ 4 r/dev

After 18 months of building, we're open-sourcing our entire production AI agent stack. Here's what's actually in it. If anyone wants to see how it works, happy to share a demo.

Hey everyone 👋

18 months ago we started building internal tooling because nothing in the market covered what we actually needed: a full production loop for AI agents, not just one piece of it.

Tracking without evaluating means that something is wrong. If you don't simulate the evaluation, you'll only find out when you release. If you don't have a feedback process, optimization is just changing prompts and hope that it works. Guardrails put on after the event miss the most important failures.

So we built the full loop. And in a few days, all of it goes open source.

Self host it. Extend it. Ship AI that improves itself.

What's actually shipping:

traceAI: OpenTelemetry-native tracing for 22+ Python and 8+ TypeScript frameworks. Your traces, your backend, no lock-in.

ai-evaluation: 70+ metrics: hallucination, factual accuracy, relevance, safety, compliance. Every scoring function is in the repo. Read it, modify it, run it in CI/CD.

simulate-sdk: Synthetic test conversations at scale for voice and chat agents. Your agent works on 10 test cases. simulate-sdk throws 500 adversarial ones at it before users do.

agent-opt: Feeds failed eval cases into a prompt optimization loop and re-evaluates the output against those exact failures. Closes the gap between "we found a problem" and "we fixed it."

Protect: Real-time input and output guardrails across content moderation, bias detection, prompt injection, and PII compliance. Text, image, and audio.

futureagi-sdk: One interface that connects all of the above.

Not a community edition. Same code running behind the platform.

Three questions for the devs here, we would like to know:

  • When your AI agent fails in production, how long does it take you to find which step caused it, the retrieval, the prompt, the tool call, or the model output?
  • Have you ever shipped a prompt change that improved one metric but quietly broke something else downstream, and only caught it after users hit it?
  • If you self-host your eval pipeline inside your own VPC, what's the biggest operational issue: maintaining the infra, keeping metrics updated, or getting the rest of the team to actually run evals before deploying?

DM if you want early access or want to see a specific part of the stack in action before the public release.

reddit.com
u/Future_AGI — 4 hours ago

We're open-sourcing our entire production AI stack in a few days after months of building it. Here's what's in it and why we made this call. If anyone wants to see how it works, happy to share a demo.

Hey everyone 👋

A few weeks back we were talking internally about a problem we kept seeing: teams building AI agents in production have no single open-source layer that covers the full lifecycle. Tracing here. Evaluation there. Guardrails somewhere else. No project closes the full loop from simulation to observability.

So we decided to open-source everything we've built at Future AGI.

Not a community edition with features stripped out. The same code running behind the platform.

Quick recap of what's shipping:

futureagi-sdk: Connects tracing, evaluation, guardrails, and prompt management in one interface.

traceAI: OpenTelemetry-native instrumentation for 22+ Python and 8+ TypeScript AI frameworks. Traces plug into any OTel-compatible backend you already run: Jaeger, Datadog, your own collector. You own your observability pipeline.

ai-evaluation: 70+ metrics covering hallucination detection, factual accuracy, relevance, safety, and compliance. Every scoring function is readable and modifiable. Run it locally, in CI/CD, or at scale. When your compliance team asks how hallucination detection works, you point them to the source file.

simulate-sdk: Generates synthetic test conversations with varied personas, intents, and adversarial inputs for voice and chat agents. Manual QA can't cover the failure surface area at scale.

agent-opt: Takes failed evaluation cases, generates improved prompt candidates, and re-evaluates them against those exact failures. Optimization without eval data is guessing.

Protect: Real-time guardrail layer screening inputs and outputs across content moderation, bias detection, prompt injection, and PII compliance across text, image, and audio.

Who it's built for:

  • AI/ML engineers shipping agents to production who need step-level visibility, not just token-level logs
  • Teams running LangChain, LlamaIndex, OpenAI, or any of the 22+ supported frameworks who are tired of building custom tracing wrappers
  • Healthcare, finance, and government teams that can't send evaluation data to third-party servers and need everything running inside their own VPC
  • Platform and DevOps engineers who want OTel-compatible traces that plug into Jaeger, Datadog, or their existing collector without vendor lock-in
  • Startups and indie builders who need production-grade eval infrastructure without a six-figure SaaS contract

Few questions:

  • What's your biggest frustration with current open-source AI observability tools?
  • If you run evals, are you using a self-hosted library or a managed platform, and what pushed you that direction?
  • For those who've dealt with GPL-3.0 components inside enterprise codebases, how did your legal team handle it?

DM if you want early access or want to see how any specific piece works before the public release.

reddit.com
u/Future_AGI — 1 day ago

How do you actually know if Opus 4.7 is better for your specific agent use case?

Anthropic shipped Opus 4.7 yesterday. The headline numbers are real: 64.3% on SWE-bench Pro (up from 53.4%), best-in-class on MCP-Atlas at 77.3% for multi-tool orchestration, 14% improvement on multi-step agentic reasoning, and one-third fewer tool errors across workflows.

Those are meaningful numbers. The problem is, they measure Anthropic's test distribution, not yours.

Where the benchmark story gets complicated:

BrowseComp dropped 4.4 points compared to Opus 4.6. That is a clear regression on research-heavy and web-browsing agentic workflows. If your agent does deep multi-step research, Opus 4.7 is not a straight upgrade. If your agent routes across multiple tools in a single workflow, MCP-Atlas at 77.3% suggests it probably is.

The point is that no single benchmark answers the question for your specific use case.

The real question teams skip:

Most teams switch models based on release notes or community buzz, run a few manual test cases, and ship. That works until a regression shows up in production two weeks later, at which point you're reading logs and guessing whether the new model or a prompt change caused it.

The gap is not access to a better model. It's a systematic way to measure whether the new model is actually better for your workload before you switch.

What a real evaluation looks like before switching:

  • Run your last 100 production outputs through a hallucination metric against your ground truth. If Opus 4.7 scores better on your data, the benchmark improvement is real for your use case. If it doesn't, it isn't.
  • Measure tool call success rate on your actual tool schemas, not a generic coding task. Opus 4.7's one-third fewer tool errors claim is meaningful only if it holds on your tool definitions.
  • Run the same inputs through both models on your worst-performing edge cases. If the failure rate drops, switch. If it doesn't, the benchmark improvement happened somewhere else.

These are not complicated to set up. They just require treating model evaluation the same way you treat any other code change: measure before you ship.

So, we built ai-evaluation specifically for this: run 70+ metrics including hallucination detection, tool call accuracy, and factual grounding directly against your production outputs so a model switch decision is based on your data, not Anthropic's benchmarks.

A few questions for people who have already tested Opus 4.7 on real workloads:

  • Did the benchmark improvement show up on your actual agent tasks, or did you see a different pattern?
  • For those running research-heavy agents, did you notice the BrowseComp regression in practice?
  • Are you running evals before switching models, or testing in production and rolling back if something breaks?
reddit.com
u/Future_AGI — 4 days ago