u/Business-Question-20

AgentOS (another one!) is an open-source TypeScript runtime for AI agents that remember, adapt, and collaborate.

GitHub: https://github.com/framersai/agentos
Real demos: https://agentos.sh/#live-demo
Benchmarks / blog post: https://agentos.sh/en/blog/agentos-memory-sota-longmemeval/ Benchmark harness: https://github.com/framersai/agentos-bench
Docs: https://docs.agentos.sh
npm: npm install @framers/agentos

To me most agent frameworks feel like prompt chains wearing a trench coat, designed with the aim to create agentic workflows easier but not actually encouraging agentic behavior, or what I call emergent agentic behavior, which is what you'd hope to see in AI NPC interactions in immersive gaming who make choices that surprise you, or in chatting with AI chatbots / companions that truly understand and learn.

Other frameworks do call tools, they can retrieve docs with high accuracy in RAG, but the tool surface is usually fixed, the team roster is fixed, and the agent starts every task half-amnesiac unless you stuff the whole world in context.

AgentOS is our attempt at making the runtime itself more adaptive, and intelligent.

What’s distinctive:

  • Memory + RAG: persistent cognitive memory backed by neuroscience, with Ebbinghaus-style decay, retrieval-induced forgetting, reconsolidation, source-confidence decay, and metacognitive "feeling-of-knowing." The idea is not to saving every chat log forever. Memories should fade, strengthen, conflict, and get reshaped when recalled.

  • Runtime-generated tools: when no existing tool fits the task, an agent can write a TypeScript function, describe its input/output schema with Zod, send it through an LLM judge, and run it in a hardened node:vm sandbox. Approved tools join the catalog for the rest of the session. First creation costs real tokens; reuse costs almost nothing. This is what truly allows for the "emergent" decision-making.

  • Specialist spawning: when a multi-agent team hits a subtask nobody covers, the manager can call spawn_specialist. A separate judge reviews the proposed agent spec, and if approved, the specialist joins the live roster on the next turn.

  • Optional personality vectors: HEXACO traits can bias retrieval, routing, and decision-making for personalization/simulation. Same prompt, same agent, different trait vector maps to a different decision sequence. This is optional; workflow agents may not need this.

  • Multimodal RAG: text/image/audio/video ingestion, 7 vector backends (Pinecone, Weaviate, SQLite, etc), multiple retrieval strategies, GraphRAG, and document loaders built-in.

  • Multi-agent orchestration: sequential, parallel, debate, review loop, hierarchical, and graph/DAG execution under one agency() factory, with streaming guardrails (like PII redaction), HITL gates, structured Zod output, and per-agent cost tracking.

Benchmarks:

  • LongMemEval-S: 85.6% with gpt-4o reader, full N=500, 10k bootstrap CIs, per-case run JSONs.
  • That is 0.4 percentage points behind EmergenceMem Internal’s published 86.0% result, which is a claimed SOTA proprietary SaaS. AgentOS.sh is open-source and free under Apache 2.0.
  • It is +1.4 points above Mastra OM’s published gpt-4o result of 84.23%.
  • LongMemEval-M: 70.2% on the harder ~1.5M-token / ~500-session variant, the only open-source framework with a published benchmark above 65%.

I’m not claiming overall benchmark SOTA. Mastra’s newer gpt-5-mini result is much higher. But we do publish latency, costs, and full transparency, in the accompanying open-source repo at https://github.com/framersai/agentos-bench.

Attached image is from a real run of one of the examples in the repo. The team starts with a researcher and a writer. The prompt asks for a security audit, which neither covers. The manager creates a new specialist at runtime. The LLM judge approves the spec. The spawned team produces the briefing on the right.

Feedback welcome, especially on whether the README and examples make the idea clear!

u/Business-Question-20 — 12 days ago