u/Awkward_Attention810

FastAPI middleware for semantic caching of LLM responses (Apache 2.0)

FastAPI middleware for semantic caching of LLM responses (Apache 2.0)

I built fastapi-semcache, a semantic caching middleware for FastAPI that lets you cache LLM‑like endpoints with minimal refactoring. It’s my first open source project, and I’d love feedback and any suggestions

from semanticcache import SemanticCache, SemanticCacheMiddleware
# fastapi_semcache is available as an import alias
# drop in middleware
cache = SemanticCache()
app.add_middleware(SemanticCacheMiddleware, cache=cache)

Example:

POST "How to add middleware in FastAPI?" -> id: gen-1778608076-lExjok7dakqTQ7TGAvr1 (MISS)
POST "How do you register middleware in FastAPI?" -> id: gen-1778608076-lExjok7dakqTQ7TGAvr1 (HIT)

It uses pgvector for similarity search and can optionally use Redis to store responses.

Main features:

  • async first
  • no langchain deps
  • configurable thresholds
  • optional 2 step thresholding (top k candidate retrieval with second threshold)
  • optional 429 circuit breaker
  • tenant isolation
  • fail open behaviour
  • optional streaming support for LLM responses on cache misses (synthetic streaming for cache hits not implemented yet)

Supports OpenAI, HuggingFace, Voyage, and Ollama embeddings out the box (Cohere support planned). You can integrate your own embedding logic by subclassing BaseEmbedder

pip install fastapi-semcache

GitHub: https://github.com/axm1647/fastapi-semcache

Feel free to ask any questions

u/Awkward_Attention810 — 8 hours ago