
Was struggling with massive “thinking” delays in my iOS AI app.
Root cause wasn’t network or streaming. It was the model processing huge search context.
Fix:
- Replaced OpenRouter web plugin with Brave + Tavily routing
- Limited sources (4–5), trimmed snippets
- Structured prompt injection instead of raw text
- Added caching + simple heuristic routing
Result:
- ~40s → ~3–5s TTFT
- Lower token cost
- Better output quality
Also using summary + buffer memory to control context size
Curious how others are handling:
- search + context injection
- reducing TTFT without hurting quality
Shipped this in my app if you want to see it in action.
u/Brilliant-Mulberry55 — 9 days ago