
How do you keep an MCP server's output reproducible when the upstream metadata is mutable?
If an agent generates a 200-citation literature review today, can it produce identical output a week from now? Most citation tools can't promise that, and getting mine to was more work than I expected.
Background: I built a citation MCP server — resolves DOI, PMID, PMCID, ISBN, arXiv, ADS, WHO IRIS; formats in Vancouver, AMA, APA, IEEE, CSE plus 10k CSL styles; exports BibTeX, RIS, CSV, EndNote XML. The tool-call shape was the easy part. Reproducibility is the part I keep coming back to.
The reason most generators can't do it: Crossref's metadata is mutable - titles get re-cased, author names corrected, abstracts back-filled, ORCID IDs added. The fallback chain when an identifier isn't in the primary source isn't documented anywhere I could find, so you don't actually know whether your citation came from Crossref, DataCite, or doi.org's content negotiation. CSL style files update without semver. "Best-effort" fields silently appear or disappear between runs. For a chat-style tool that's fine. For an agent producing a bibliography someone has to defend in peer review or a clinical audit, it's a correctness problem — a reviewer can't reproduce what got submitted.
What I ended up doing was emit an x-scholar-transform-version header on every response (currently 2026-05-04). I bump it whenever normalisation, formatter output, or the resolver chain change in a way that would alter byte-identical output for the same input. Agents that care about reproducibility pin against it.
The actual resolver chain is published at /.well-known/sources.json — primary plus fallback hosts per identifier type, mirrored against the live code. DOI is Crossref then DataCite then doi.org; ISBN is OpenLibrary then Google Books; PMID and PMCID are NCBI; arXiv, ADS, WHO IRIS are direct. The chain is fixed-order, first non-empty wins. No quality scoring, no "best of N." Quality scoring across sources is great for chat but a nightmare for reproducibility because the scoring inputs themselves drift.
There's also a /verification page with a copy-paste curl kit so anyone can spot-check the determinism claim and the provenance headers without taking my word for it. That one was a direct response to evaluator feedback that determinism claims aren't worth much unless they're independently verifiable.
Honest about what this doesn't fix. CSL styles can drift across engine versions; transform_version covers the engine, but only if you actually pin to it. It doesn't help if a retracted paper gets silently corrected upstream — but that's the point, a bibliography should reproduce what you submitted, not what's true today. Retraction status lives behind a separate endpoint with no determinism promise. The server itself is a thin MCP shim over a hosted REST API, so if you need fully local, this isn't it.
The package is scholar-sidekick-mcp on npm. Genuinely curious how other people are handling drift in MCP servers that wrap mutable upstream data — feels under-discussed for how much agent output downstream depends on it.