u/Cosmicdev_058 — reddlx

Hey everyone 👋

Was reading the deepseek v4 docs this morning and noticed they kept prefill support for chat completions. For anyone who has not used it, prefill lets you pass an assistant message with prefix=True and the model continues from your prefix instead of generating its own opener.

Their example is forcing the model into a python code block by passing "python\n" as the assistant prefix and setting stop=[""]. The model has no choice but to start with python code, no preamble, no "sure here is the code," just the function. That alone solves half the structured output problems I deal with on production agents.

The reason this matters more than it sounds, most of the major providers quietly dropped this capability over the last year. OpenAI never had it on chat completions. Anthropic had it, then made it harder to use. Google's gemini API has nothing equivalent. The pattern was clear, providers prefer you go through their structured output APIs which are easier for them to monetize and limit.

Prefill is the most reliable way I have found to constrain model behavior for agent loops where you need exact format compliance. JSON schemas help, function calling helps, but prefill is the only mechanism that just removes the generation-of-the-opening-token problem entirely.

Anyone else been working around the loss of prefill on other providers? Curious what the workaround patterns look like, beyond "ask the model nicely and hope it follows instructions."

More here: https://api-docs.deepseek.com/guides/chat_prefix_completion

Hey everyone 👋

I shipped a reference implementation of a LangGraph agent that answers questions spanning a structured orders database and a corpus of operational PDFs, and I open sourced the whole thing.

The driving query is the kind of question an ops analyst would actually ask: "how did Margherita pizza perform in 2024 across cities, and what allergens does it contain." Sales numbers live in a SQLite orders table, allergens live inside the Menu Book PDF. The agent has to decide which tool to call when, and how to merge the results into a single answer with sources.

🔗 Repo: https://github.com/orq-ai/orq-langgraph-demo

https://preview.redd.it/k1tu6czy2pxg1.png?width=2048&format=png&auto=webp&s=11213f621e95a95e696e4a02ef77f09d42bfe1bb

Here is what is inside:

🧠 LangGraph topology with safety check, intent routing, the tool loop, and a clarification path for vague questions

📚 Hybrid Knowledge Base over six operational PDFs exposed as three typed tools using the content_and_artifact pattern, which is what lets the Chainlit UI render PDF previews opened to the cited page

🔁 AI Router for every LLM call, swap providers with one env var change, the trace shows you the exact model that served each run

🧪 Four scorer eval pipeline with 15 test cases across SQL only, doc only, and mixed scenarios, one local Python scorer for tool accuracy plus three LLM judges, wired into GitHub Actions to block merges on regression

🔍 Two tracing backends side by side, callback handler and OpenTelemetry exporter, switch with one env var depending on whether you already run a collector

🏗️ Two implementations of the same agent in one repo, code first LangGraph and Studio first managed Agent, talking to the same Knowledge Base and the same model

https://preview.redd.it/zxjjdk2y2pxg1.png?width=2048&format=png&auto=webp&s=2f0f2f32c6d6706d91bd9712cdc0e851660ac2d0

The thing that surprised me most was how much of the work lived outside the graph. The topology took a day to land. The eval pipeline, the tracing decision, the prompt versioning, and the guardrail wiring took the next two weeks.

Anyone running similar hybrid agents in prod, what broke first for you, retrieval quality or eval signal?

Work on this at Orq AI fwiw.