u/utilitron

Is a cognitive‑inspired two‑tier memory system for LLM agents viable?

Is a cognitive‑inspired two‑tier memory system for LLM agents viable?

I’ve been working on a memory library for LLM agents that tries to control context size by creating a short term and long term memory store (I am running on limited hardware so context size is a main concern). It’s not another RAG pipeline; it’s a stateful, resource-aware system that manages memory across two tiers using pluggable vector storage and indexing:

* **Short‑Term Memory (STM)**: volatile, fast, with FIFO eviction and pluggable vector indexes (HNSW, FAISS, brute‑force). Stores raw conversation traces, tool calls, etc.

* **Long‑Term Memory (LTM)**: persistent, distilled knowledge. Low‑saliency traces are periodically consolidated (e.g., concatenation or LLM summarization) into knowledge items and moved to LTM.

**Saliency scoring** uses a weighted RIF model (Recency, Importance, Frequency). The system monitors resource pressure (e.g., RAM/VRAM) and triggers consolidation automatically when pressure exceeds a threshold (e.g., 85%).

What I’m unsure about:

  1. Does this approach already exist in a mature library? (I’ve seen MemGPT, Zep, but they seem more focused on summarization or sliding windows.)

  2. Is the saliency‑based consolidation actually useful, or is simple FIFO + time‑based summarization enough?

  3. Are there known pitfalls with using HNSW for STM (e.g., high update frequency, deletions)?

  4. Would you use something like this?

Thanks!

Source:

It was originally written in Java and I am working on porting to python.

Python https://github.com/Utilitron/VecMem

Java https://github.com/Utilitron/VectorMemory

u/utilitron — 1 day ago

Is a cognitive‑inspired two‑tier memory system for LLM agents viable?

I’ve been working on a memory library for LLM agents that tries to control context size by creating a short term and long term memory store (I am running on limited hardware so context size is a main concern). It’s not another RAG pipeline; it’s a stateful, resource-aware system that manages memory across two tiers using pluggable vector storage and indexing:

  • Short‑Term Memory (STM): volatile, fast, with FIFO eviction and pluggable vector indexes (HNSW, FAISS, brute‑force). Stores raw conversation traces, tool calls, etc.
  • Long‑Term Memory (LTM): persistent, distilled knowledge. Low‑saliency traces are periodically consolidated (e.g., concatenation or LLM summarization) into knowledge items and moved to LTM.

Saliency scoring uses a weighted RIF model (Recency, Importance, Frequency). The system monitors resource pressure (e.g., RAM/VRAM) and triggers consolidation automatically when pressure exceeds a threshold (e.g., 85%).

What I’m unsure about:

  1. Does this approach already exist in a mature library? (I’ve seen MemGPT, Zep, but they seem more focused on summarization or sliding windows.)
  2. Is the saliency‑based consolidation actually useful, or is simple FIFO + time‑based summarization enough?
  3. Are there known pitfalls with using HNSW for STM (e.g., high update frequency, deletions)?
  4. Would you use something like this?

Thanks!

reddit.com
u/utilitron — 1 day ago