u/Awkward_Attention810

I built fastapi-semcache, a semantic caching middleware for FastAPI that lets you cache LLM‑like endpoints with minimal refactoring. It’s my first open source project, and I’d love feedback and any suggestions

from semanticcache import SemanticCache, SemanticCacheMiddleware
# fastapi_semcache is available as an import alias
# drop in middleware
cache = SemanticCache()
app.add_middleware(SemanticCacheMiddleware, cache=cache)

Example:

POST "How to add middleware in FastAPI?" -&gt; id: gen-1778608076-lExjok7dakqTQ7TGAvr1 (MISS)
POST "How do you register middleware in FastAPI?" -&gt; id: gen-1778608076-lExjok7dakqTQ7TGAvr1 (HIT)

It uses pgvector for similarity search and can optionally use Redis to store responses.

Main features:

async first
no langchain deps
configurable thresholds
optional 2 step thresholding (top k candidate retrieval with second threshold)
optional 429 circuit breaker
tenant isolation
fail open behaviour
optional streaming support for LLM responses on cache misses (synthetic streaming for cache hits not implemented yet)

Supports OpenAI, HuggingFace, Voyage, and Ollama embeddings out the box (Cohere support planned). You can integrate your own embedding logic by subclassing BaseEmbedder

pip install fastapi-semcache

GitHub: https://github.com/axm1647/fastapi-semcache

Feel free to ask any questions

FastAPI middleware for semantic caching of LLM responses (Apache 2.0)