u/Tricky_School_4613

▲ 19 r/AI_developers+13 crossposts

How do you actually test a voice AI agent without calling it yourself every time?

So we've been working on a voice bot that handles customer calls and honestly the testing part has been brutal. We were literally calling the thing ourselves to check if it broke after every change.

Eventually we just wrote a framework that synthesizes fake caller audio, pipes it into the agent, and checks if the response is sane — latency, hallucinations, whether it handles interruptions, etc. Runs locally against a SQLite db, no cloud stuff.

It connects over websockets, can mock twilio streams, works with elevenlabs and vapi agents too. You can also plug in ollama as the judge so the whole thing runs offline.

We open sourced it: https://github.com/unforkopensource-org/decibench

Curious how others here handle this. Are you just vibing and hoping production doesn't break or is there a better workflow I'm missing?

u/Tricky_School_4613 — 12 hours ago
▲ 1 r/AZURE

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago
▲ 2 r/nocode

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago
▲ 1 r/ollama

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago
▲ 4 r/Rag

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual in 2026?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago

Why is voice agent testing still so manual?

Been working on voice agents for some time now and one thing honestly feels very ignored — testing.

We have frameworks for prompts, observability, workflows, telephony etc. but when it comes to actually stress testing agents across interruptions, accents, latency, rage users, silence, bad network, tool failure, retries, context drift… most teams are still doing it manually or with basic scripts.

Feels weird that in 2026 we still don’t have a proper automated benchmarking/testing layer for conversational agents like traditional software has.

Curious how others here are handling this at scale? Especially for outbound calling and production QA.

reddit.com
u/Tricky_School_4613 — 5 days ago