
Is a cognitive‑inspired two‑tier memory system for LLM agents viable?
I’ve been working on a memory library for LLM agents that tries to control context size by creating a short term and long term memory store (I am running on limited hardware so context size is a main concern). It’s not another RAG pipeline; it’s a stateful, resource-aware system that manages memory across two tiers using pluggable vector storage and indexing:
* **Short‑Term Memory (STM)**: volatile, fast, with FIFO eviction and pluggable vector indexes (HNSW, FAISS, brute‑force). Stores raw conversation traces, tool calls, etc.
* **Long‑Term Memory (LTM)**: persistent, distilled knowledge. Low‑saliency traces are periodically consolidated (e.g., concatenation or LLM summarization) into knowledge items and moved to LTM.
**Saliency scoring** uses a weighted RIF model (Recency, Importance, Frequency). The system monitors resource pressure (e.g., RAM/VRAM) and triggers consolidation automatically when pressure exceeds a threshold (e.g., 85%).
What I’m unsure about:
Does this approach already exist in a mature library? (I’ve seen MemGPT, Zep, but they seem more focused on summarization or sliding windows.)
Is the saliency‑based consolidation actually useful, or is simple FIFO + time‑based summarization enough?
Are there known pitfalls with using HNSW for STM (e.g., high update frequency, deletions)?
Would you use something like this?
Thanks!
Source:
It was originally written in Java and I am working on porting to python.