u/Brilliant-Mulberry55

▲ 1 r/VibeCodeDevs+1 crossposts

Was struggling with massive “thinking” delays in my iOS AI app.

Root cause wasn’t network or streaming. It was the model processing huge search context.

Fix:

  • Replaced OpenRouter web plugin with Brave + Tavily routing
  • Limited sources (4–5), trimmed snippets
  • Structured prompt injection instead of raw text
  • Added caching + simple heuristic routing

Result:

  • ~40s → ~3–5s TTFT
  • Lower token cost
  • Better output quality

Also using summary + buffer memory to control context size

Curious how others are handling:

  • search + context injection
  • reducing TTFT without hurting quality

Shipped this in my app if you want to see it in action.

u/Brilliant-Mulberry55 — 9 days ago