u/Brilliant-Mulberry55

▲ 1 r/VibeCodeDevs+1 crossposts

Was struggling with massive “thinking” delays in my iOS AI app.

Root cause wasn’t network or streaming. It was the model processing huge search context.

Fix:

Replaced OpenRouter web plugin with Brave + Tavily routing
Limited sources (4–5), trimmed snippets
Structured prompt injection instead of raw text
Added caching + simple heuristic routing

Result:

~40s → ~3–5s TTFT
Lower token cost
Better output quality

Also using summary + buffer memory to control context size

Curious how others are handling:

search + context injection
reducing TTFT without hurting quality

Shipped this in my app if you want to see it in action.

u/Brilliant-Mulberry55 — 9 days ago