r/hermesagent

How much do you guys spend on AI costs monthly

I have been trying to set up Hermes but i only have my local model which is not great for my use case so I have been thinking of using deepseek. But wondering how much you guys spend on AI tokens per month

reddit.com
u/ReadingHopeful2152 β€” 4 hours ago

Why β€œDon’t use Opus for everything?”

Basically it’s the most common β€œtip” in this subreddit. But I wanna know is there a legitimate reason to use cheaper models like qwen or kimi, except for the obvious cost issue?

reddit.com
u/Soundpulse99 β€” 4 hours ago

How to remove 'Hermes Agent' and all the dependencies it installed on Windows ?

  1. Hermes installed it's own python, vs studio, chocolatey etc. didn't use what I had already installed. How to remove those.
  2. Also I want to re-install it on non-system disk
  3. How to have it not install it's own dependencies ?

https://github.com/nousresearch/hermes-agent

u/Healthy-News5375 β€” 1 hour ago

Jailbreak AI Model

Alright, gpt 5.5 isn't cooperating with the godmode skill or any work arounds to delegate work it deams top high risk or against other company TOS.

So what is the next best model that can actually ignore those guardrails but still be powerful like gpt 5.5?

Trying to do things like:

Evading bot detection systems

Masquerading as real humans to bypass platform rules

Large-scale fake engagement

Account farming

Ad fraud

Scraping behind authentication without permission

Circumventing CAPTCHAs or anti-abuse systems

Fingerprint spoofing intended to deceive platforms

Stealth automation specifically designed to avoid detection

reddit.com
u/Pepe_The_Citizen β€” 5 hours ago
β–² 12 r/hermesagent

Please critique my planned Hermes Agent setup

After briefly experimenting with Hermes Agent on my Macbook (with a local Gemma 4 model), I'm about to set up a more robust instance on a VPS. I'd appreciate your input on my config - both the choices I already pinned down and the ones I'm still deciding on (such as the specific LLMs).

For context: My use cases are still pretty open ended and exploratory, and will evolve over time. While I have a lot of experience with coding agents (Claude Code in particular) and I've built my own agentic pipelines, I'm fairly new to more general, always-on agents like Hermes. Never used OpenClaw etc. I envision using Hermes for extended research tasks (say into YouTube stats and patterns, app opportunities, gamedev, various niches, etc.) as well as daily briefings, using both my calendar and Todoist as well as current news. I want Hermes to be able to generate and share markdown reports and potentially other files (images, videos, PDFs) as well. I'd like to be able to forward emails to Hermes to have it act on them. I'm less likely to use it for coding directly, but will reassess this later.

Below is a high level list of my planned config, with some more details below:

  • VPS: Hetzner CPX22 (2 VCPU, 4 GB RAM, 80 GB SSD
  • LLM Provider: OpenRouter
  • LLMs: TBD, see below
  • Memory Provider: ByteRover
  • Web Search Provider: TBD; Tavily or Firecrawl
  • Long Term Data Storage: Obsidian Vault and Google Drive (details below)
  • Messaging: Telegram

More details on some of these choices and open questions below.

VPS:

Hetzner because of competitive pricing. (I've mostly used DigitalOcean in the past, but their 4 GB instance is 2.5x the cost.)

I believe CPX22 with 4 GB RAM and 2 VCPUs is the right sweet spot for my needs?

LLMs:

This will likely require some experimentation. For custom apps, I've mostly used different flavors of Gemini (e.g. Gemini 2.5 Pro, Flash, and Flash Lite, depending on use case). So that's definitely a contender. Flash might be a good default model.

DeepSeek V4 also seems attractive, primarily because of the low cost.

Open to open source models like Qwen, Gemma 4, or Kimi 2 as well.

I'll read through more of the recommendations in this subreddit, but let me know if you have any particular recommendations for combos that have worked well for you.

Memory:

After researching the officially supported options, I landed on ByteRover. I like the file based approach and git semantics, as well as the tiered search. At least on paper, it seems more than suitable for my needs. I'd just use the local setup, with backup to Github.

I considered Hindsight, as the idea of a knowledge graph sounds compelling, and I've had great results with Postgres and pgvector for my own apps. But realistically, this is overkill for my needs right now.

Web Search:

Firecrawl and Tavily seem like the most popular options. Tavily seems to have the more generous free tier, but Firecrawl seems more commonly suggested. Any thoughts on the trade-offs here? Any alternative recommendations?

Data Storage:

A combination of Obsidian and Google Drive.

I already use Obsidian as my personal note taking app. I would set up a separate vault for Hermes Agent and sync this to a Github repo. I envision using this for more detailed, longer term data. Things like research reports, daily briefings, etc.

I want to explore ByteRover's "swarm" feature as well. It sounds like it can perform federated searches across its own memory and my Obsidian vault, which sounds compelling.

Google Drive is already my main cloud storage for personal and business related files. I would give Hermes read-only access to specific folders that might be needed for certain tasks. I would only give it write access to a dedicated "Hermes" folder.

I realize there are several ways to set up Google Drive support. I lean towards using the official Google Workspace CLI; see below.

Google Workspace:

I would create a dedicated Hermes account under my existing Google Workspace domain. That way, I can cleanly provision access to Google Drive, Email, Calendar, etc.

The official Google Workspace CLI sounds like the cleanest solution. That way, I can not only access Google Drive, but also Email etc. The CLI comes with agent skills that should make the Hermes integration pretty straightforward and robust. I should even be able to leverage ModelArmor to scan incoming emails to prevent prompt injection.

Other Integrations:

Telegram for messaging. (Perhaps Discord in the future.)

Todoist; haven't looked into plugins / MCPs / APIs yet.

Please let me know if you have any feedback or suggestions for improvement on this config. Thanks! Looking forward to getting deeper into Hermes Agent and uncovering more use cases over time. πŸ˜„

(Edit: Added a section for Web Search.)

u/digitalhobbit β€” 9 hours ago

Best Local LLM for 24GB of VRAM?

Ive got a 7900xtx with 24gb vram

and want to run hermes with a local llm or using the local llm for the 90-99% of "easier tasks" and routing the hard tasks to a model like kimi k2.6

does somebody have a similiar hardware setup and some tips on what model to choose and how to optimize the hermes setup

how are you guys doing it and what are some general considerations/tips from you?

thanks

reddit.com
u/Material-Mention6696 β€” 8 hours ago

What can Hermes actually do for efficient management, marketing, and business applications?

I’m tired of seeing nonsense and hype around Hermes. What can it actually do in real-world applications related to efficient management, marketing, and business?

I’m looking for practical examples, real use cases, workflows, automation, productivity gains, decision-making support, client management, marketing operations, or anything genuinely useful beyond the exaggerated claims.

reddit.com
u/Sufficient-Mood-4442 β€” 6 hours ago

Struggling with cron jobs, SOS

Hi all, I’ve been using agents for about a month now and everything has been fantastic apart from cron jobs.

I was mostly sold on the dream of having an agent work behind the scenes, constantly coming up with new ideas, questions to ask, etc. but I must be setting it up wrong.

For example, I want my agent to passively do research for my short film. I’d love for it to suggest set design / props based on the time period, or suggest new ideas, ask clarifying questions randomly, but for the most part it will just repeat the same message to me over and over and barely escape the initial prompt.

Does anyone have a good YouTube video or tip for creating cron jobs that really let the agent come to life and be innovative?

reddit.com
u/Malakhaiii β€” 8 hours ago
β–² 33 r/hermesagent

Is this common?

I just setup hermes agent and use kimi-k2.6 from ollama cloud. It thinks a lot πŸ₯² compared to OC, this gives much better results and already embed the skills without me asking it. But the process really push my anxiety πŸ˜‚

u/Funny-Comfortable858 β€” 13 hours ago

Built an agent memory upgrade β€” went from flat context to a 5-tier hybrid arch. Thoughts?

I've been running an AI coding agent (Hermes Agent, open source) and hit the usual wall: the built-in persistent memory is capped at ~2,200 chars. Fine for quick facts, useless for anything that scales.

So I layered it up into a 5-tier hybrid memory architecture:

Tier 1 β€” RAM (Built-in, ~2.2K chars)

Short-term working memory. Pipe-syntax compression, TTL flags per entry. Only the hottest ~15 facts live here.

Tier 2a β€” Vector DB (Chroma, 7,600+ embeddings)

Semantic search over everything the agent has ever stored. Good for fuzzy recall ("that thing about the dental case").

Tier 2b β€” Knowledge Graph (triple store, 18 relations, 25 entities)

This is what I just added. Structured relations: user β†’ lives_in β†’ city, project β†’ port β†’ 8223, bot β†’ developed_for β†’ person. KG queries are instant and precise β€” no embedding drift.

Tier 3 β€” Full-text session DB (SQLite FTS5)

Every message ever sent, searchable with boolean queries. Auto-saved, no management needed.

Tier 4 β€” Obsidian vault (user-readable markdown)

Mirror of key facts in editable .md files. The human can read/edit without any tooling.

New hybrid search cascade:

KG query β†’ FTS5 session search β†’ Vector search β†’ Obsidian read

Plus a session diary β€” after every complex session, the agent writes an AAAK-compressed log entry to the vector DB for cross-session context.

---

What's your take on this? Overkill? Missing something obvious? Anybody else running a multi-tier memory setup for their agents? Would love to hear what's working (or not) for others.

reddit.com
u/sokomania β€” 8 hours ago

Anyone have Hermes agent wired up for local LLM's using oMLX or llama-swap?

I don't really see how Hermes agent switches models, can someone point me in the right direction?

I have llama swap serving for llama-cpp-turboquant on a windows machine and the oMLX is Mac.

reddit.com
u/DanGTG β€” 14 hours ago
β–² 24 r/hermesagent

actually best hermes agent vps hosting ?

I suppose no one is under the Mac Mini hype here anymore. I would like to create a community curated list of best hosts for Hermes to filter the most reliable ones because there are so many of them now. I've seen people buying some, having issues and then no support provided by these small hosts, this definitely has to stop.

If you experienced any good or bad ones and have any recommendations, suggest them in the comments and they will be added into the list.

Maybe also recommend any features that make them stand out.

PS: here's the list based on comments https://github.com/devpals399/best_hermes_vps_providers feel free to contribute

reddit.com
u/FunThen4634 β€” 19 hours ago
β–² 20 r/hermesagent

A comprehensive method to brutally reduce your Agentic AI token cost by at least 95%, aka a summary of current token reduction method. Running it for only 15$/month

The core concept

1.Organize your bootstrapping files in a tree-like structure. Let LLM indexing information, rather than load all of the agent/tool/skill markdown at once.

Just like how wo store a billion people's reddit ID in our disk and we won't use a list to store and retrieve it.(That's how agent frame work is doing with bootstrapping and system prompting)

We use a B-tree. We reduce the complexity from O(n) to O(log(n)). Same for LLM and make LLM get necessary information

2.Use AI to compress and compact bootstrapping files, and make these files indexing detailed knowledge

3.Layer your models: Use a very lightweight but long context capable model as the primary model. It will be able to capture your idea and understand your intention. Only switch to expensive SOTA models when you really have to deal with intelligence-heavy mission (You are overthrowing theory of general relativity), manually, or let your small model decide

4.Bootstrapping bypassed: Write a simple python script to direct send message to LLM. Ask your light weight model to take the minimum necessary context and information out from your huge chat history. And then send the important stuff to openrouter/GPT/Claude.

5.Using openclaw console commands and server terminal.

/new, /compact,/status,/usage full

Make good use of this commands to reduce token cost and get an idea of how fast you are burning your cash

Directly controlling your server, typing openclaw gateway restart yourself, instead of let AI do it to reduce related token cost

6.CPU-fying task. If most of your day is repeating certain logic, turn it into a python script. If you tend to anaylize stock using a specific technical approach, convert it into a python script or ask your light weight LLM to convert it for you.

Turn your repetitive tasks into crons, this take tasks from GPU to CPU and it will save you hundreds of $ per month.

If you want to analysis a data sheet, you just don't send the sheet to LLM, you ask your LLM to write a python pandas code to get the result

  1. Deal with heartbeat. Reduce its frequency.

Unless you are feeling deeply depressed, lack friends, or have an intense craving for new notifications from instant messaging apps, you should reduce the frequency of the "Heartbeat" functionβ€”or even disable it entirely. Heartbeat proactively pings the LLM using specific logic to prompt a response; however, this response does not actually help you resolve any substantive issues. You are a geek, most of your time is in front of a monitor so you are already always with your agent.

You might even consider offloading Heartbeat tasks to the CPU.

For instance, if you want to scan an agent's metadata to check for the presence of plaintext keys or tokens, there is clearly no need to have an LLM perform this check every half hourβ€”a process that is both costly and insecure. You can resolve this entirely using regular expressions; the CPU can instantly scan all documents within the agent's directory and pinpoint any files containing leaked secrets.

Note: This is for openclaw but it also applies to any agent framework based on system prompt engineering.

You can simply ask your agent to read this document for you, and he will be able to understand it....You don't need to worry at all. Your agent will explain it to you how to use this method

Note: If you want to read it please switch to markdown version

# OpenClaw Token Optimization Techniques - Complete Analysis


> **Author**: User A Β· Agent-X Β 
> **Last Updated**: 2026-05-19 Β 
> **Applicable Version**: OpenClaw 2026.4.23+ Β 
> **Target Audience**: Technical personnel interested in AI Agent / LLM cost optimization Β 


---


## Table of Contents


- [OpenClaw Token Optimization Techniques - Complete Analysis](
#openclaw-token-optimization-techniques---complete-analysis
)
Β  - [Table of Contents](
#table-of-contents
)
Β  - [Part 1: Principles - Hidden Costs of System Prompts](
#part-1-principles---hidden-costs-of-system-prompts
)
Β  Β  - [Bootstrap File Loading Mechanism](
#bootstrap-file-loading-mechanism
)
Β  Β  - [Context Window and Compaction Mechanism](
#context-window-and-compaction-mechanism
)
Β  Β  - [Fixed Overhead per New Session](
#fixed-overhead-per-new-session
)
Β  - [Part 2: Testing - Bootstrap File Quantitative Analysis](
#part-2-testing---bootstrap-file-quantitative-analysis
)
Β  Β  - [File Volume Before Optimization](
#file-volume-before-optimization
)
Β  Β  - [File Volume After Optimization](
#file-volume-after-optimization
)
Β  Β  - [Cumulative Consumption Comparison by Usage Scenario](
#cumulative-consumption-comparison-by-usage-scenario
)
Β  - [Part 3: Optimization - Seven Core Techniques](
#part-3-optimization---seven-core-techniques
)
Β  Β  - [1. Tree-Structured Document Architecture (Old: Single File β†’ New: Multi-Layer Index)](
#1-tree-structured-document-architecture-old-single-file--new-multi-layer-index
)
Β  Β  Β  - [Optimization Principle](
#optimization-principle
)
Β  Β  Β  - [Measured Data](
#measured-data
)
Β  Β  Β  - [Cost Savings (Monthly)](
#cost-savings-monthly
)
Β  Β  Β  - [Resource Consumption Changes](
#resource-consumption-changes
)
Β  Β  - [2. AI Auto-Compression (Compaction)](
#2-ai-auto-compression-compaction
)
Β  Β  Β  - [Optimization Principle](
#optimization-principle-1
)
Β  Β  Β  - [Measured Comparison](
#measured-comparison
)
Β  Β  Β  - [Cost Savings](
#cost-savings
)
Β  Β  Β  - [Resource Consumption Changes](
#resource-consumption-changes-1
)
Β  Β  - [3. Local Model Management of Lightweight Tasks (QMD / Ollama)](
#3-local-model-management-of-lightweight-tasks-qmd--ollama
)
Β  Β  Β  - [Optimization Principle](
#optimization-principle-2
)
Β  Β  Β  - [QMD Application](
#qmd-application
)
Β  Β  Β  - [Measured Data](
#measured-data-1
)
Β  Β  Β  - [Cost Savings](
#cost-savings-1
)
Β  Β  Β  - [Resource Consumption Changes](
#resource-consumption-changes-2
)
Β  Β  - [4. Direct Script-to-API Calls, Bypassing Bootstrap](
#4-direct-script-to-api-calls-bypassing-bootstrap
)
Β  Β  Β  - [Optimization Principle](
#optimization-principle-3
)
Β  Β  Β  - [Measured Data](
#measured-data-2
)
Β  Β  Β  - [Resource Consumption Changes](
#resource-consumption-changes-3
)
Β  Β  - [5. Console Commands Replace LLM Conversation](
#5-console-commands-replace-llm-conversation
)
Β  Β  Β  - [Optimization Principle](
#optimization-principle-4
)
Β  Β  Β  - [Practical Application](
#practical-application
)
Β  Β  Β  - [Resource Consumption Changes](
#resource-consumption-changes-4
)
Β  Β  - [6. Daily Logic CPU-fication (Python Cron Direct Push)](
#6-daily-logic-cpu-fication-python-cron-direct-push
)
Β  Β  Β  - [Optimization Principle](
#optimization-principle-5
)
Β  Β  Β  - [Implemented CPU-fied Tasks](
#implemented-cpu-fied-tasks
)
Β  Β  Β  - [Measured Comparison](
#measured-comparison-1
)
Β  Β  Β  - [Technical Implementation](
#technical-implementation
)
Β  Β  Β  - [Resource Consumption Changes](
#resource-consumption-changes-5
)
Β  Β  - [7. Intelligent Demands Pulled Back from LLM to CPU (Heartbeat Checklist-ification)](
#7-intelligent-demands-pulled-back-from-llm-to-cpu-heartbeat-checklist-ification
)
Β  Β  Β  - [Optimization Principle](
#optimization-principle-6
)
Β  Β  Β  - [Transformation Comparison](
#transformation-comparison
)
Β  Β  Β  - [Measured Data](
#measured-data-3
)
Β  Β  Β  - [Cost Savings](
#cost-savings-2
)
Β  Β  Β  - [Resource Consumption Changes](
#resource-consumption-changes-6
)
Β  - [Comprehensive Benefit Assessment](
#comprehensive-benefit-assessment
)
Β  Β  - [Monthly Cost Comparison Summary](
#monthly-cost-comparison-summary
)
Β  Β  - [Annualized Comparison](
#annualized-comparison
)
Β  Β  - [Beyond Just Saving Money](
#beyond-just-saving-money
)
Β  - [Appendix 1: Model Pricing Reference](
#appendix-1-model-pricing-reference
)
Β  - [Appendix 2: Vectorization of Skill Descriptors](
#appendix-2-vectorization-of-skill-descriptors
)
Β  - [Conclusion](
#conclusion
)


---


## Part 1: Principles - Hidden Costs of System Prompts


### Bootstrap File Loading Mechanism


Each time `/new` or `/reset` is executed to create a new session, the OpenClaw runtime automatically loads the following content as **System Prompt + Startup Context**:


| File | Loading Method | Purpose |
|------|----------------|---------|
| `AGENTS.md` | System Prompt Injection | Agent behavior instruction tree |
| `SOUL.md` | System Prompt Injection | Personality definition |
| `USER.md` | System Prompt Injection | User information |
| `HEARTBEAT.md` | System Prompt Injection | Scheduled task checklist |
| `TOOLS.md` | System Prompt Injection | Local tool configuration |
| `MEMORY.md` | Startup Context | Long-term memory |
| `memory/*.md` (past 2 days) | Startup Context | Daily work logs (≀2800 characters) |


These files are **not visible in the conversation history**, but **consume actual context window**. Every LLM inference must process this content.


### Context Window and Compaction Mechanism


OpenClaw's compaction mechanism uses a `mode: safeguard` strategy:


- **Trigger Condition**: Automatically triggered when conversation history + bootstrap approach the context limit
- **Compression Method**: Generate summaries of early conversations, retain recent details
- **Problem**: If the bootstrap file itself is large, less space remains for actual conversations, compaction triggers more frequently, and each compaction consumes tokens


### Fixed Overhead per New Session


Using the default model MiniMax M2.7 (200K context window) as an example:


> **Before Optimization**: bootstrap ~25,000 bytes β‰ˆ ~6,250 tokens Β 
> **After Optimization**: bootstrap ~8,300 bytes β‰ˆ ~2,075 tokens Β 
> 
> Each session startup saves **~4,175 tokens**, not including subsequent chain effects of compaction in conversations.


The same principle applies to models like DeepSeek V3.2 (200K context). If your daily usage involves frequent `/new` / `/reset` (e.g., task switching, context cleanup), the savings double.


---


## Part 2: Testing - Bootstrap File Quantitative Analysis


> All data below are based on actual file measurements. Sensitive content has been anonymized: usernames β†’ "User A", Agent names β†’ "Agent-X".


### File Volume Before Optimization


| File | Lines | Bytes | Estimated Tokens | Main Content |
|------|-------|-------|------------------|--------------|
| AGENTS.md | ~300 | ~12,000 | ~3,000 | Behavior rules, skill index, memory rules, quick decisions all mixed |
| MEMORY.md | ~200 | ~8,000 | ~2,000 | Holdings info, built systems, technical architecture, user goals |
| SOUL.md | 36 | 1,673 | ~418 | Personality definition |
| USER.md | 11 | 278 | ~70 | Username/timezone/preferences |
| TOOLS.md | 34 | 827 | ~207 | Search toolchain, local configuration |
| HEARTBEAT.md | 28 | 1,681 | ~420 | Heartbeat checklist |
| **Total** | **~609** | **~24,459** | **~6,115** | |


### File Volume After Optimization


| File | Lines | Bytes | Estimated Tokens | Change |
|------|-------|-------|------------------|--------|
| AGENTS.md | 56 | 2,278 | ~570 | ⬇️ **-81%** |
| MEMORY.md | 62 | 1,589 | ~397 | ⬇️ **-80%** |
| SOUL.md | 36 | 1,673 | ~418 | β€” |
| USER.md | 11 | 278 | ~70 | β€” |
| TOOLS.md | 34 | 827 | ~207 | β€” |
| HEARTBEAT.md | 28 | 1,681 | ~420 | β€” |
| **Total** | **~227** | **~8,326** | **~2,082** | ⬇️ **-66%** |


> Extracted detailed rules moved to `docs/` subdirectory (5 files, 9,452 bytes total), loaded on-demand by LLM via `read` tool, no longer injected with bootstrap.


### Cumulative Consumption Comparison by Usage Scenario


Assuming typical usage patterns:
- **Daily Conversation**: 10 rounds per day, average 500 tokens input + 200 tokens output per round
- **Lightweight Tasks**: 2 tasks per day, 3,000 tokens context each
- **Session Rebuild**: ~3 times per day `/new` or `/reset`


**Monthly Consumption Before Optimization**:
```
Daily conversation: 10 Γ— 700 = 7,000 tokens
Daily tasks: 2 Γ— 3,000 = 6,000 tokens
Bootstrap loading (Γ—3): 3 Γ— 6,115 = 18,345 tokens
────────────────────────────
Daily total: 31,345 tokens
Monthly total: 31,345 Γ— 30 β‰ˆ 940,350 tokens β‰ˆ 0.94M tokens
```


**Monthly Consumption After Optimization**:
```
Daily conversation: 10 Γ— 700 = 7,000 tokens
Daily tasks: 2 Γ— 3,000 = 6,000 tokens Β 
Bootstrap loading (Γ—3): 3 Γ— 2,082 = 6,246 tokens
────────────────────────────
Daily total: 19,246 tokens
Monthly total: 19,246 Γ— 30 β‰ˆ 577,380 tokens β‰ˆ 0.58M tokens
```


**Bootstrap optimization alone saves ~0.36M tokens/month**. Combined with other optimization techniques, total savings far exceed this number.


---


## Part 3: Optimization - Seven Core Techniques


### 1. Tree-Structured Document Architecture (Old: Single File β†’ New: Multi-Layer Index)


#### Optimization Principle


The more content crammed into an AI Agent's system prompt, the more it must "read before acting" for each inference. Traditional approaches mix all rules, indexes, and memories into one large file (e.g., AGENTS.md with 300 lines), requiring the LLM to process all 300 lines before thinking about your problem.


**Solution**: Shrink AGENTS.md and MEMORY.md to index files (<60 lines), split detailed rules by module into `docs/` subdirectory. The LLM sees only the index at startup and reads specific documents on demand.


```
workspace-qqclaw/
β”œβ”€β”€ AGENTS.md Β  Β  Β  Β  Β (56 lines) ← Top-level index, contains document tree
β”œβ”€β”€ MEMORY.md Β  Β  Β  Β  Β (62 lines) ← Summary memory
β”œβ”€β”€ docs/
β”‚ Β  β”œβ”€β”€ OPENROUTER.md Β (68 lines)
β”‚ Β  β”œβ”€β”€ WEB-SEARCH.md Β (43 lines)
β”‚ Β  β”œβ”€β”€ MEMORY-SYSTEM.md (64 lines)
β”‚ Β  β”œβ”€β”€ TRADE-MONITOR.md (97 lines)
β”‚ Β  └── MULTI-SEARCH.md Β (94 lines)
```


#### Measured Data


| Metric | Before Optimization | After Optimization | Savings |
|--------|---------------------|-------------------|---------|
| AGENTS.md tokens | ~3,000 | ~570 | **81%** |
| MEMORY.md tokens | ~2,000 | ~397 | **80%** |
| Bootstrap Total | ~6,115 | ~2,082 | **66%** |
| Total Documentation | ~24,459 bytes | ~17,778 bytes (incl. docs/) | Comparable lines, structural optimization |


#### Cost Savings (Monthly)


Based on Sonnet ($9/MT average) calculation:


```
Before: 6,115 tokens Γ— 3 sessions/day Γ— 30 days = 550,350 tokens Γ— $9/MT = $4.95/month (bootstrap only)
After: 2,082 tokens Γ— 3 sessions/day Γ— 30 days = 187,380 tokens Γ— $9/MT = $1.68/month (bootstrap only)
Savings: $0.27/month
```


Seems small? But this is just the bootstrap single-point saving. The real payoff: **smaller system prompts mean each conversation processes ~4,000 fewer tokens, compaction triggers less frequently, and more conversation rounds fit in the window.**


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| Context per-round inference | ⬇️ ~4,000 tokens |
| Compaction trigger frequency | ⬇️ Delayed trigger (longer effective conversation) |
| LLM response latency | ⬇️ Slight decrease (reduced prompt processing) |
| Documentation maintainability | ⬆️ Improved (modular, changes don't affect others) |


---


### 2. AI Auto-Compression (Compaction)


#### Optimization Principle


OpenClaw's compaction mechanism is like "automatic conversation history summarization":


- When conversation history + system prompt approach the context limit, early conversations are compressed into summaries
- In `mode: safeguard`, the system automatically triggers at the safety boundary
- Compressed content is preserved in summary form, freeing space for new conversations


**Why does this save tokens?** If your context window is 200K tokens, without compression, each inference must process the full conversation history (possibly 50K-100K tokens). After compression, early history becomes a 1K-2K summary, and new conversations only need 10K-30K tokens.


#### Measured Comparison


| Scenario | No Compression | With Compression (safeguard) |
|----------|----------------|------------------------------|
| Context after 100 rounds | ~120,000 tokens | ~25,000 tokens |
| Per-round token consumption | ~1,200 (full) | ~600 (summary + new round) |
| LLM failure critical point | ~170 rounds | Theoretically infinite |


#### Cost Savings


Based on Haiku long conversation scenario:


```
No compression: 100 rounds Γ— 1,200 tokens/round = 120,000 tokens/day Γ— $9/MT Γ— 30 days = $32.4/month
With compression: 100 rounds Γ— 600 tokens/round = 60,000 tokens/day Γ— $9/MT Γ— 30 days = $16.2/month
Savings: $1.33/month
```


Calculated per day with 100 conversation rounds. Combined with other optimizations, total savings are even more impressive.


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| GPU (inference burden) | ⬇️ Prompt processing reduced ~50% |
| Context utilization | ⬆️ Improved (more effective conversations) |
| Summary quality | ⚠️ Summaries may lose details (safeguard mode is conservative, low risk) |


---


### 3. Local Model Management of Lightweight Tasks (QMD / Ollama)


#### Optimization Principle


Not all tasks require large models. OpenClaw supports a tiered model strategy:


| Task Type | Model | Cost | Notes |
|-----------|-------|------|-------|
| Heartbeat Detection | Ollama qwen2.5:3b | **$0** | Local CPU inference |
| Security Audit | Ollama qwen2.5:3b | **$0** | Scans every 5 minutes |
| Memory Retrieval | QMD 2.1.0 | **$0** | Local semantic search |
| Complex Conversation | MiniMax / DeepSeek | Paid | Only complex tasks via cloud |


#### QMD Application


QMD (v2.1.0) is a local embedded vector retrieval engine for semantic search in the memory system:


- **Builtin**: SQLite + FTS5 full-text search (has latency)
- **QMD**: Standalone sidecar process, local vector search (zero latency, zero API cost)


```
QMD Search Process:
memory_search(query) β†’ QMD sidecar β†’ local embedding β†’ return Top-K results
No external API calls throughout, token consumption = 0
```


#### Measured Data


| Metric | Before | After | Savings |
|--------|--------|-------|---------|
| Heartbeat Detection (per 30min) | ~200 tokens/call cloud | 0 tokens (qwen local) | 100% |
| Security Audit (per 5min) | ~500 tokens/call cloud | 0 tokens (qwen local) | 100% |
| Memory Retrieval | ~1,500 tokens/call (cloud semantic search) | 0 tokens (QMD local) | 100% |


#### Cost Savings
Based On GPT5.4 Nano
```
Heartbeat Detection: 48 calls/day Γ— 200 tokens Γ— $0.73/MT Γ— 30 days = $0.21/month
Security Audit: 288 calls/day Γ— 500 tokens Γ— $0.73/MT Γ— 30 days = $3.20/month
Memory Retrieval: 10 calls/day Γ— 1,500 tokens Γ— $0.73/MT Γ— 30 days = $0.33/month
────────────────────────────────────────────────────────
Total Savings: ~$3.74/month
```


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| CPU Usage | ⬆️ Slight increase (qwen 3b + QMD inference) |
| GPU Usage | ⬇️ Significant decrease (high-frequency tasks offline) |
| Network I/O | ⬇️ Fewer API calls |
| Response Speed | ⬆️ Faster local inference (no network latency) |


---


### 4. Direct Script-to-API Calls, Bypassing Bootstrap


#### Optimization Principle


Traditional approach: Have the LLM read workspace context β†’ analyze problem β†’ return result. But many repetitive tasks (like portfolio analysis, market briefs) can **directly call APIs with Python scripts**, bypassing the LLM bootstrap process.


```
Traditional Path (wasting tokens):
cron trigger β†’ LLM load bootstrap(2K tokens) β†’ understand task(500 tokens) β†’ curl API β†’ return(800 tokens)


Optimized Path (zero bootstrap):
cron trigger β†’ Python direct API call β†’ format output β†’ QQ push
Never passes through LLM


or


You use a medium level model like Minimax 2.7 token plan with 10$ monothly. And you ask your agent to rely on Minimax 2.7(Or local LLM). When you need to solve a complicated logic problem or difficult task, you ask your agent on Minimax to take only the necessary text out and send it to openrouter's bigger LLMs, thus bypassing the bootstarp
and the entire history context.
```


**Key Script**: Write a python sciprt such as "ask_openrouter.py",
and put it in your agent's workspace. When you feel the problem is complex and it is better to solve it with bigger and expensive models, you ask your agent to use this script to use openrouter's big models.


```python
# Minimal OpenRouter call β€” no workspace pollution
# Send pure requests directly, don't load AGENTS/SOUL/MEMORY etc.
```


Similarly, `ask_openrouter_search.py` bypasses the LLM for direct web searches. (Openrouter allows you to add a :online suffix to any model to enable web search)


#### Measured Data


Using "portfolio deep analysis" task as example:


| Path | Tokens per Task | Cost (Sonnet) |
|------|-----------------|----------------|
| Via LLM channel(even without any history) | ~9,000 tokens | $0.081 |
| Python direct call OpenRouter | ~1,200 tokens (API input/output only) | $0.01 |
| **Savings** | **87%** | β€” |


With 2 analysis tasks daily: **$0.007 Γ— 30 = $0.21/month** (this single task isn't expensive, but the pattern scales to all cron tasks)
If you are using 10 cron to do the news analysis,you save 2.1$/month


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| GPU / LLM Inference | ⬇️ Bypass bootstrap, reduce ~3,300 tokens/task |
| Network Calls | β†’ Flat (still need API calls) |
| Maintainability | ⬆️ Python scripts more controllable than prompt engineering |


---


### 5. Console Commands Replace LLM Conversation


#### Optimization Principle


Many operations don't need LLM involvement at all. For example, restarting services, checking status, running known scripts β€” use shell's `exec` tool directly, no need for the LLM to "understand" and then "act".


PS: This part requires user's intervention


```
User: "Restart openclaw"
LLM method: Load bootstrap β†’ understand intent β†’ generate command β†’ execute β†’ return result
Β  Β  Β  Β  Β  Β  (~3,000 tokens wasted)


exec method: Directly execute openclaw gateway restart
Β  Β  Β  Β  Β  Β  Β (0 tokens)
```


#### Practical Application


| Scenario | LLM Channel | exec Direct | Savings |
|----------|-------------|-------------|---------|
| Restart Service | ~3,000 tokens | 0 tokens | 100% |
| Check Service Status | ~3,500 tokens | 0 tokens | 100% |
| Run Monitoring Script | ~2,000 tokens | 0 tokens | 100% |
| View Logs | ~2,500 tokens | 0 tokens | 100% |


Based on GPT5.4 nano:
With 5 such operations per day as example: **~14,000 tokens Γ— $0.74/MT Γ— 30 = $0.31/month**


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| LLM Inference | ⬇️ Zero LLM for daily maintenance |
| Response Speed | ⬆️ Immediate execution (no LLM processing latency) |
| Error Rate | ⬇️ Deterministic execution (no hallucination risk) |


---


### 6. Daily Logic CPU-fication (Python Cron Direct Push)


#### Optimization Principle


If high-frequency scheduled tasks (like market monitoring, price pushes) go through LLM each time, token consumption is astronomical. Correct approach:


```
Python script β†’ fetch data β†’ conditional judgment β†’ push directly via QQ/notification channel
Never goes through LLM, GPU untouched
```


#### Implemented CPU-fied Tasks


| Task | Frequency | Method | Savings |
|------|-----------|--------|---------|
| πŸ“Š Intraday Monitoring | Every 10 min | Python `intraday_watch.py` direct IM push | 100% |
| πŸͺ™ BTC/ETH Monitoring | Every 15 min | Python `price_monitor.py` direct IM push | 100% |
| 🌀️ Airticket Check | Every 2 hours | Python `airticket_monitor.py` direct IM push | 100% |
| 🌑️ Weather Forecast | 2x per day | Python `weather_monitor.py` direct IM push | 100% |
| πŸ” Security Scan | Every 30 min | qwen2.5:3b local scan | 100% |


#### Measured Comparison


If these 5 tasks went through LLM (based on Sonnet):


```
Intraday: 144 calls/day Γ— 1,500 tokens = 216,000 tokens/day
BTC Monitoring: 96 calls/day Γ— 1,200 tokens = 115,200 tokens/day
METAR: 8 calls/day Γ— 1,000 tokens = 8,000 tokens/day
Weather: 2 calls/day Γ— 1,000 tokens = 2,000 tokens/day
Security Scan: 288 calls/day Γ— 500 tokens = 144,000 tokens/day
──────────────────────────────────────────────
Daily: 485,200 tokens β†’ 14.6M/month
Monthly Cost: 14.6M Γ— $9/MT = $131/month
```


**Actual Monthly Cost: $0** (all CPU-fied, zero LLM consumption)


#### Technical Implementation


```python
# intraday_check.py Core Logic
# 1. Fetch market data (LongBridge API)
# 2. Calculate volatility
# 3. Conditional judgment: index >1.2% or stock >2%
# 4. subprocess Popen direct IM push (8s timeout to prevent hang)
# 5. Zero LLM tokens throughout
```


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| LLM API Calls | ⬇️ Reduce ~500 calls/day |
| CPU Usage | ⬆️ Python script polling cost |
| Real-time Performance | ⬆️ No need to wait for LLM processing |
| Reliability | ⬆️ Deterministic logic (no hallucinations) |


---


### 7. Intelligent Demands Pulled Back from LLM to CPU (Heartbeat Checklist-ification)


#### Optimization Principle


Heartbeat is OpenClaw's "heartbeat" mechanism β€” periodically triggers specific tasks. But if heartbeat content is written as vague natural language, the LLM must "understand" it each time.


**Checklist-ification Transformation**: Convert heartbeat prompts into structured execution checklists, run with the lightest model (qwen2.5:3b, locally free), only do simple status confirmation.


#### Transformation Comparison


Before Transformation (traditional heartbeat):
```
"You are a security audit expert, please check system configurations..."
β†’ LLM (MiniMax) must understand this prompt β†’ inference β†’ execution
β†’ ~500 tokens/call
```


After Transformation (checklist-ified heartbeat):
```yaml
heartbeat:
Β  model: ollama/qwen2.5:3b Β  Β  Β  Β # Local free model
Β  lightContext: true Β  Β  Β  Β  Β  Β  Β  # Don't load full workspace
Β  prompt: "Execute steps: 1.Read cron file 2.Check key leaks 3.Output HEARTBEAT_OK"
# If output == HEARTBEAT_OK β†’ nothing happens
# If output != HEARTBEAT_OK β†’ push to user
```


#### Measured Data


| Metric | Before | After | Savings |
|--------|--------|-------|---------|
| Heartbeat Model | GPT Nano ($0.75/MT) | qwen local ($0) | 100% |
| Tokens per Step | ~500 | ~200 (but free) | 100% cost |
| Context Mode | full (2K bootstrap) | lightContext (no bootstrap) | Additional 2K savings |
| Security Audit | MiniMax ($0.75/MT) | qwen local ($0) | 100% |


#### Cost Savings
Based on GPT5.4 Nano
```
Heartbeat Detection: 48 calls/day Γ— (200+2000) tokens Γ— $0.74/MT Γ— 30 = $2.34/month
Security Audit: 288 calls/day Γ— (500+2000) tokens Γ— $0.74/MT Γ— 30 = $12.00/month
──────────────────────────────────────────────────────────────
Total Savings: ~$14.34/month
```


> This is just single-agent user savings. Enterprise deployments (multiple agents, multiple workspaces) multiply this by the number of agents.


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| API Cost | ⬇️ Two highest-frequency tasks completely offline |
| CPU Usage | ⬆️ qwen2.5:3b continuous running (~3GB RAM) |
| Security | ⬆️ Security audit independent of external APIs (data stays server-side) |
| Response Latency | ⬆️ Faster (local model inference <50ms) |


---


## Comprehensive Benefit Assessment


### Monthly Cost Comparison Summary
Based on mixed use of Sonnet, GPT 5.4 Nano...
| Optimization | Before | After | Monthly Savings |
|--------------|--------|-------|-----------------|
| β‘  Tree-Structured Documents | $0.41 | $0.14 | $0.27 |
| β‘‘ Compaction | $2.66 (no compression long-term session) | $1.33 | $1.33 |
| β‘’ Local Models (QMD + Ollama) | $3.74 | $0 | $3.74 |
| β‘£ Direct Script API 10 crons | $2.4 | $0.3 | $2.1 |
| β‘€ exec Replacing LLM | $0.31 | $0 | $0.31 |
| β‘₯ CPU-fying cron Tasks | $131 | $0 | $131 |
| ⑦ Heartbeat Checklist-ification | $14.34 | $0 | $14.34 |
| **Total** | **$154.86** | **$1.77** | **$153.09/month** |


> Above is conservative single-user usage estimation. In actual use, conversation patterns vary, savings fluctuate, but trends are consistent.


### Annualized Comparison


| Metric | Before Optimization | After Optimization |
|--------|---------------------|-------------------|
| Annual API Cost | ~$1829 | ~$200 |
| Annual Savings | β€” | **~$1629** |
| Carbon Reduction (est.) | β€” | ~95% API calls reduction |


### Beyond Just Saving Money


| Dimension | Benefit |
|-----------|---------|
| ⚑ **Response Speed** | High-frequency tasks from LLM polling β†’ CPU direct push, latency from seconds to milliseconds |
| πŸ”’ **Privacy & Security** | Memory retrieval and data audit localized, no need to upload to third-party APIs |
| πŸ›‘οΈ **Stability** | CPU tasks independent of API availability, won't break due to API downtime |
| πŸ“ **Maintainability** | Rule files modularized, changing one piece doesn't affect others |
| πŸ§ͺ **Testability** | Python scripts can be unit tested, LLM prompts rely only on "feelings" |


---


## Appendix 1: Model Pricing Reference


| Model | Provider | Input $/MT | Output $/MT | Average $/MT |
|-------|----------|-----------|-----------|-------------|
| MiniMax M2.7(fixed number monthly if token plan ) | MiniMax API | $0.279 | $1.20 | $0.74 |
| DeepSeek V3.2 | OpenRouter | $0.252 | $0.378 | $0.315 |
| DeepSeek V4 Flash | OpenRouter | $0.112 | $0.224 | $0.168 |
| DeepSeek V4 Pro | OpenRouter | $0.435 | $0.87 | $0.6525 |
| Gemini Flash 3 | OpenRouter | $0.25 | $1.5 | $0.875 |
| GPT-5.4 Nano | OpenRouter | $0.2 | $1.25 | $0.7 |
| GPT-5.5 Pro | OpenRouter | $30 | $180 | $105 |
| Claude Opus 4.7 | OpenRouter | $30 | $150 | $90 |
| Claude Sonnet 4.6 | OpenRouter | $3 | $15 | $9 |
| Grok 4.3 | OpenRouter | $1.25 | $2.5 | $1.875|
| Qwen2.5:3b | Ollama Local | $0 | $0 | $0 |


---


## Appendix 2: Vectorization of Skill Descriptors


If you install too many skills in openclaw, all skill descriptors will appear in your agent's system prompt. You can have your Agent install a RAG module, combined with openclaw message hooks, to intercept messages before sending to LLM, vectorize them, compare with local skill vector chunks, and only pull relevant skill portions into the system prompt. This can save you tens of thousands to hundreds of thousands of tokens.


---


## Conclusion


The core idea of seven optimization techniques can be summarized in one sentence:


> **Transform LLM from "all-purpose butler" to "expert advisor" β€” CPU-fy daily operations, let complex reasoning go to large models.**


This is not just a cost-saving strategy, but an architectural philosophy: AI Agent intelligence should be layered β€” high-frequency low-complexity tasks processed locally on CPU, low-frequency high-complexity tasks processed by cloud large models. This ensures both response speed and privacy while making every API dollar count.


---


*This document is based on OpenClaw 2026.4.23 testing. Data compiled on 2026-05-19 with sensitive information anonymized. Model prices based on current announcement at that time.*
reddit.com
u/dxzzzzzz β€” 15 hours ago

Can anyone help me to understand why hermes doesn't wait for /approve message? Mattermost, but I'm guessing it's every messenger behavior

u/zhandouminzu β€” 13 hours ago

Test driving Hermes what am I missing?

I have done my fair share of AI work, custom setups with vercel sdk, N8N, windmill, custom nestjs

Have Claude openai perplexity accounts and subs to open router and deepseek. Done a lot of skill and MCP creations

I was missing a setup that runs autonomous and does hard grunt work for me.

Test driving Hermes but a bit disappointed, I am a fan of opensource and appropriate this project but I am still missing all the hype.

- the Kanban setup I like the idea, but can't delete a task? No dropdown to select a profile?

- weird setup with profiles which don't really act as subagents?

- the UI sucks, really hard to read fonts and colors?

- no way to expand the logs per Kanban task properly? I am scrolling in a tiny window

- I like the gateway thing, but using telegram for a while.. it doesn't fit the purpose, shouldn't this project have a dedicated (web) app for managing the setup?

Within 10 minutes I had quirks/bugs and patches which don't survive updates

Watched some YT videos but it doesn't really seem to click for me in the ergonomics department. I feel like I am missing something obvious. I know exactly what I want but it seems a different thinking paradigm.

Again not hating, hope to get some tips or closure haha.

reddit.com
u/krimpenrik β€” 14 hours ago

Using Gemini and Claude Subscriptions on Hermes

Noob on Hermes here.

I have Google AI Pro and Claude Pro subscriptions and I wonder if I can use them on Hermes without paying for extra API fees.

I searched a bit and apparently there is no way to use Claude subscription on Hermes since it only uses API which falls under extra usage.

I have been using Gemini models as well with API from Google AI Studio so I have to pay extra for it. When I check my setup I see two options for Google;

Google AI Studio (Gemini models β€” native Gemini API) ← currently active

Google Gemini via OAuth + Code Assist (free tier supported; no API key needed)

My understanding was that second one uses my subscription and first one uses API key apparently which I am currently using. Is that correct at all?

reddit.com
u/helioserebuss β€” 17 hours ago
β–² 4 r/hermesagent+3 crossposts

For people who don't want to set up or manage Openclaw

Hey everyone β€” we’re building Zynth, a personal AI assistant on WhatsApp, and we’re slowly rolling out beta access as we scale up our infra.

The idea is simple: message it like you would message an assistant.

It can help with things like:
- daily news/topic briefs
- research and monitoring
- reminders and scheduled tasks
- summarizing links, files, emails, or notes
- creating small AI agents for recurring workflows
- connecting apps like Gmail, Calendar, Sheets, Slack, and more

We’re looking for early users to test it, break it, and tell us what use-cases they’d actually want an assistant like this to handle.

You can join the beta here:
https://zynth.ai/whatsapp-ai-agent

Would love feedback, feature requests, and examples of tasks you’d want to automate on WhatsApp.

u/nuanda92 β€” 12 hours ago

Built a single Hermes agent that trades across Stocks, Crypto & Polymarket using one API

I wanted to build one agent that could handle trading decisions across multiple markets β€” stocks, crypto, and Polymarket β€” without juggling different APIs and data formats.

Most financial data providers require separate integrations, different response structures, and inconsistent error handling. This quickly becomes messy when you’re trying to build a single autonomous agent.

I ended up using Kapit (an agent-native financial data API) as the single source of truth. It gives one consistent schema across stocks, crypto, and Polymarket, plus structured error recovery that agents can actually act on.

With this setup, I now have a single Hermes agent that can:

- Pull real-time stock and crypto prices

- Check Polymarket markets and probabilities

- Make triage decisions (e.g., compare crypto moves vs Polymarket events)

- Operate with just one API key and one response format

Still early, but it’s been much cleaner than managing multiple data sources.

reddit.com
u/Visible-Register56 β€” 14 hours ago

M4 Max 36gb, oMLX - model recommendation?

I have tried a few models, but cant find a suitable model thats useful and runs on my machine.

Gemma4:27b has good speed (outputting up to 40t/s) and is awesome for textbased stuff, but is sloppy AF when it comes to tool calling and agent work. With this setup i have a nice chat bot but is not very useful as an agent helper.

I tried Qwen3.6-27B-4bit, which runs so unbearably slow (1.5t/s) so i can not even test if it works better with hermes.

I tried GLM-4.7-Flash-MLX-6bit and it seems to have okay speed and at least can reliably call the tools, but seems to crash the omlx server frequently... thus not very usable

I know that 36gb memory is not a lot for local llms but is there a good sweet spot model for accuracy, tool calling and speed?

reddit.com
u/kentabenno β€” 19 hours ago

how do you secure secret keys on server running hermes agent with full terminal access

i am running hermes agent on VPS (nothing else on the machine). i gave it full terminal access outside Docker because inside Docker it felt too restricted β€” it could not do much. But now i realize it can potentially read all my API keys, .env files…

and now i am thinking about either: running it as a dedicated user with restricted file permissions, or inside docker with proper configuration

has anyone found a setup that keeps the agent fully functional while actually isolating secrets? would love to hear whats working for people…

reddit.com
u/mf-mj β€” 20 hours ago