u/CutZealousideal9132

We stopped optimizing our LLM stack manually — it optimizes itself now

Three months ago we were manually picking which model to use for each task. Testing prompts, comparing outputs, switching providers. It worked but it did not scale.

So we built a feedback loop. Every request gets traced with input, output, model, tokens, cost, latency, and a quality score. The router clusters similar requests using embeddings and learns which model actually performs best for each cluster. Not based on benchmarks. Based on real production results.

After three weeks of traces we had enough validated data to fine-tune a 7B on our workloads. It took over classification, tagging, and summarization. 95% agreement with GPT-5.1 at 2% of the cost.

The part that surprised us: month 3 we changed nothing and the bill dropped another 12%. The router had more data points, made better decisions, and the fine-tuned model kept improving as we fed it more validated traces.

Hallucination detection runs on every response. Bad outputs get flagged automatically and become negative examples in the next training round. Good outputs become positive training data.

The system compounds. More traffic means more traces. More traces means better routing and better training data. Better models means lower cost per request.

Month 1: $420/mo. Month 2: $73/mo. Month 4: still dropping.

Anyone else building self-improving loops into their AI stack?

reddit.com
u/CutZealousideal9132 — 4 days ago

Most teams write prompts, ship them, and never look at the data again. We started tracing every single prompt in production with input, output, cost, latency, and a quality score.

After three weeks we had 50k validated request-response pairs. Outputs that users accepted, quality scores above threshold, no hallucinations flagged.

Used that dataset to fine-tune a 7B on our specific workloads. Classification, tagging, summarization. The fine-tuned model now handles 80% of traffic at 2% of GPT-5.1 cost with 95% agreement rate.

The loop keeps going. New traces feed the next training round. Flagged hallucinations become negative examples. The router learns which prompts need frontier models and which ones the 7B handles fine.

reddit.com
u/CutZealousideal9132 — 7 days ago

Four AI features in our SaaS. Set up cost tracking per feature and compared it with adoption rates.

Auto-tagging: $89/mo, 12% adoption. Classification: almost free after rerouting to smaller model, 94% adoption. Summarization: $248/mo, moderate usage.

We were planning a fifth feature. Auto-report generator. Long outputs on GPT-5.1 daily per account. Estimated $500/mo. Based on our data, maybe 10-15% adoption.

Killed it. Improved features people actually use instead. Moved summarization to cheaper provider, $16/mo. Total bill from $420/mo to $73/mo.

Cost per feature vs adoption rate is the simplest way to decide what to build next.

Anyone else using cost data to drive product decisions?

reddit.com
u/CutZealousideal9132 — 8 days ago

We run a B2B SaaS with AI features. Three months ago we set up a self-improving loop.

Month 1: rerouted simple tasks to cheaper models. Bill from $420/mo to $234/mo.

Month 2: fine-tuned a 7B on production traces. Took over 80% of traffic at 2% of GPT-5.1 cost. Bill to $73/mo.

Month 3: changed nothing. Bill dropped another 12% on its own.

How it works: every request gets traced with cost, latency, and quality score. The router clusters similar requests using embeddings and learns which model handles each type best. Good outputs become training data for the next fine-tuning round. Bad outputs flagged by hallucination detection become negative examples.

More traffic means more data. More data means better routing and better models. Better models mean lower cost. It compounds.

For growth this matters because AI margin improves over time instead of staying flat. Every user interaction makes the system smarter and cheaper.

Anyone else building self-improving AI into their product?

reddit.com
u/CutZealousideal9132 — 9 days ago

We run four AI features across 13 providers. After three months of tracing every request here is what we found.

OpenAI: best overall on complex reasoning but worst latency spikes during peak hours. Some calls hit 8+ seconds from congestion alone.

Anthropic: most consistent at following system prompts. Haiku outperformed GPT-5.1 on classification while costing a fraction.

DeepSeek: matched GPT-5.1 on summarization quality. $16/mo vs $248/mo. Better latency too.

Groq: fastest for simple tasks. Sub-100ms on classification. Great for latency-sensitive workloads.

No single provider wins everything. Routing each task to the best provider dropped our bill from $420/mo to $73/mo with zero user-facing outages.

Anyone else running multi-provider?

reddit.com
u/CutZealousideal9132 — 9 days ago
▲ 2 r/CLI

https://preview.redd.it/yx9id3pji7zg1.jpg?width=1600&format=pjpg&auto=webp&s=d3f74cd4758c21e0172c33306b78ca4e249682f6

Most LLM setups are static. Pick a model, hardcode it, cost never changes. We built something that learns from its own traffic.

The loop is simple. Every request gets logged with model, tokens, cost, latency, and a quality score. The router clusters similar requests using embeddings and tracks which model performs best per cluster using real production data.

After a few weeks you have enough validated traces to fine-tune a smaller model on your specific workloads. We trained a 7B that now handles 80% of traffic at 2% of GPT-5.1 cost.

Auto-evaluation checks every response for hallucinations. Flagged outputs become negative examples for the next training round.

The system compounds. More traffic, more traces. More traces, better routing. Better routing, lower cost. Month 1 was $420. Month 2 was $73. Month 3 dropped another 12% without us changing anything.

Self-hosted, MIT licensed, works with any provider with a REST API.

https://github.com/OpenTracy/OpenTracy

reddit.com
u/CutZealousideal9132 — 10 days ago

We were spending $420/mo on LLM APIs. Every feature on GPT-5.1, no visibility into costs, no fallback when providers went down.

Built OpenTracy to fix that. It is a reverse proxy between your app and any LLM provider. You change one URL and it handles everything.

What makes it different: it improves itself over time. Every request gets traced. After a few weeks of production data, the router learns which models work best for which tasks. You can export those traces as training data to fine-tune smaller models on your specific workloads.

Our results after 3 months: 80% of traffic now runs on a fine-tuned 7B at 2% of GPT-5.1 cost. Summarization from $248/mo to $16/mo. Total bill from $420/mo to $73/mo and still dropping.

Features: smart routing by task complexity, auto-fallback across 13+ providers, full tracing with cost per feature, hallucination detection, model distillation pipeline.

Stack: Go, ClickHouse, React. Self-hosted, MIT licensed.

https://github.com/OpenTracy/OpenTracy

Looking for feedback from anyone running LLM workloads in production.

https://preview.redd.it/k2yn12he17zg1.jpg?width=1600&format=pjpg&auto=webp&s=3d81bb74833f11414135fe3bd760269f3d99be3b

reddit.com
u/CutZealousideal9132 — 10 days ago

Most AI setups are static. Pick a model, deploy, cost stays the same forever. We wanted something that gets better over time using its own data.

Every API call gets traced. Input, output, model, tokens, cost, latency, quality score. After a few weeks we had thousands of validated request-response pairs.

Used those traces to fine-tune a 7B model on our specific workloads. It now handles 80% of traffic at 2% of GPT-5.1 cost. Quality matches because training data came from outputs users already accepted.

The router learns too. Clusters similar requests using embeddings, routes based on real production performance. Every week more data, better decisions.

Auto-evaluation runs on every response. Flags hallucinations before users see them. Flagged instances feed back as negative examples in the next training round.

The loop: traces feed evaluation, evaluation feeds routing, routing feeds distillation, distilled models handle more traffic cheaper, new traces restart the cycle.

Bill went from $420/mo to $73/mo and keeps trending down.

Open source, self-hosted, MIT licensed.

https://github.com/OpenTracy/OpenTracy

reddit.com
u/CutZealousideal9132 — 11 days ago
▲ 8 r/SaaS

We run a B2B SaaS with AI features. Users started reporting that summaries were "making things up." Maybe 1 in 50 calls, but enough to erode trust fast.

The scary part: our monitoring showed nothing wrong. Status 200, latency normal, tokens normal. A hallucinated response looks identical to a good one in every standard dashboard.

We added automated evaluation on every response. A lightweight check that flags when the model states something not in the input context. Every request now gets a quality score alongside cost and latency.

Two surprises. Most hallucinations came from one feature where context was too long. Model lost track of relevant info. Shortened it, hallucinations dropped to near zero. Second, when we moved classification to a smaller model, hallucinations on that task went down. Smaller models overthink less on simple tasks.

If you are only monitoring latency and error rates, you are missing the quality layer. A response can be fast, cheap, and completely wrong.

reddit.com
u/CutZealousideal9132 — 13 days ago

I was running everything on GPT-5.1 for months. Never questioned it because outputs were good.

Logged every API call for 30 days and categorized by complexity. 62% were simple tasks. Classification, yes/no decisions, short extractions. GPT-5.1 at $10/1M output tokens for that when a $0.25/1M model gives the same answer.

Summarization was $248/mo alone. Tested same prompts on a cheaper provider. Identical output. $16/mo.

Only about 20% of calls genuinely needed GPT-5.1. Multi-step reasoning and long context chat.

Bill went from $420/mo to $73/mo. Zero prompt changes.

Anyone else audited their prompt complexity vs model cost?

reddit.com
u/CutZealousideal9132 — 15 days ago

Running AI features for B2B clients. Bill hit $420/mo on GPT-5.1 and I had no idea which feature was eating the budget.

Logged every API call for 30 days. Found that 62% were simple tasks like classification and tagging. GPT-5.1 at $10/1M output tokens for stuff a $0.25/1M model handles fine. Summarization alone was $248/mo, switched provider, same quality, $16/mo.

After rerouting by complexity, bill dropped to $73/mo. Same product, no prompt changes.

Anyone here tracking costs per feature or just checking the total bill each month?

reddit.com
u/CutZealousideal9132 — 15 days ago

I build AI automations for B2B clients. For months I was sending everything through GPT-5.1 without questioning it. Bill was $420/mo.

Finally logged every call for 30 days. Model, tokens, cost, latency, all tagged by feature.

60% of calls were simple classification and yes/no decisions. GPT-5.1 charges $10/1M output tokens for that. A smaller model at $0.25/1M handles it the same. Summarization was $248/mo alone, switched provider, same quality, $16/mo. Only 20% of traffic actually needed a frontier model.

Bill dropped from $420 to $73/mo. No prompt changes, no quality loss.

If you are working with LLM APIs and have not broken down costs by use case yet, even two weeks of data will surprise you.

Happy to share the tracking setup if anyone is interested.

reddit.com
u/CutZealousideal9132 — 16 days ago
▲ 21 r/IndianTechCentral+6 crossposts

We run a B2B SaaS with four AI features on GPT-5.1. Classification, summarization, chat, and auto-tagging. Bill was $420/mo.

Two months ago I started logging every API call by feature. Turned out summarization alone was $248/mo and classification was using GPT-5.1 at $10/1M output tokens for simple yes/no tasks.

Moved summarization to a cheaper provider ($16/mo, same quality) and rerouted classification to a smaller model. Bill dropped to $73/mo.

We use an open source gateway called OpenTracy for the routing and tracing. It logs every call with model, cost, latency per feature. https://github.com/OpenTracy/OpenTracy

Anyone else doing per-feature cost tracking on their AI stack? Curious how others are handling this.

u/CutZealousideal9132 — 4 days ago