u/dxzzzzzz

A comprehensive method to brutally reduce your Agentic AI token cost by at least 95%, aka a summary of current token reduction method. Running it for only 15$/month

The core concept

1.Organize your bootstrapping files in a tree-like structure. Let LLM indexing information, rather than load all of the agent/tool/skill markdown at once.

Just like how wo store a billion people's reddit ID in our disk and we won't use a list to store and retrieve it.(That's how agent frame work is doing with bootstrapping and system prompting)

We use a B-tree. We reduce the complexity from O(n) to O(log(n)). Same for LLM and make LLM get necessary information

2.Use AI to compress and compact bootstrapping files, and make these files indexing detailed knowledge

3.Layer your models: Use a very lightweight but long context capable model as the primary model. It will be able to capture your idea and understand your intention. Only switch to expensive SOTA models when you really have to deal with intelligence-heavy mission (You are overthrowing theory of general relativity), manually, or let your small model decide

4.Bootstrapping bypassed: Write a simple python script to direct send message to LLM. Ask your light weight model to take the minimum necessary context and information out from your huge chat history. And then send the important stuff to openrouter/GPT/Claude.

5.Using openclaw console commands and server terminal.

/new, /compact,/status,/usage full

Make good use of this commands to reduce token cost and get an idea of how fast you are burning your cash

Directly controlling your server, typing openclaw gateway restart yourself, instead of let AI do it to reduce related token cost

6.CPU-fying task. If most of your day is repeating certain logic, turn it into a python script. If you tend to anaylize stock using a specific technical approach, convert it into a python script or ask your light weight LLM to convert it for you.

Turn your repetitive tasks into crons, this take tasks from GPU to CPU and it will save you hundreds of $ per month.

If you want to analysis a data sheet, you just don't send the sheet to LLM, you ask your LLM to write a python pandas code to get the result

  1. Deal with heartbeat. Reduce its frequency.

Unless you are feeling deeply depressed, lack friends, or have an intense craving for new notifications from instant messaging apps, you should reduce the frequency of the "Heartbeat" function—or even disable it entirely. Heartbeat proactively pings the LLM using specific logic to prompt a response; however, this response does not actually help you resolve any substantive issues. You are a geek, most of your time is in front of a monitor so you are already always with your agent.

You might even consider offloading Heartbeat tasks to the CPU.

For instance, if you want to scan an agent's metadata to check for the presence of plaintext keys or tokens, there is clearly no need to have an LLM perform this check every half hour—a process that is both costly and insecure. You can resolve this entirely using regular expressions; the CPU can instantly scan all documents within the agent's directory and pinpoint any files containing leaked secrets.

Note: This is for openclaw but it also applies to any agent framework based on system prompt engineering.

You can simply ask your agent to read this document for you, and he will be able to understand it....You don't need to worry at all. Your agent will explain it to you how to use this method

Note: If you want to read it please switch to markdown version

# OpenClaw Token Optimization Techniques - Complete Analysis


> **Author**: User A · Agent-X  
> **Last Updated**: 2026-05-19  
> **Applicable Version**: OpenClaw 2026.4.23+  
> **Target Audience**: Technical personnel interested in AI Agent / LLM cost optimization  


---


## Table of Contents


- [OpenClaw Token Optimization Techniques - Complete Analysis](
#openclaw-token-optimization-techniques---complete-analysis
)
  - [Table of Contents](
#table-of-contents
)
  - [Part 1: Principles - Hidden Costs of System Prompts](
#part-1-principles---hidden-costs-of-system-prompts
)
    - [Bootstrap File Loading Mechanism](
#bootstrap-file-loading-mechanism
)
    - [Context Window and Compaction Mechanism](
#context-window-and-compaction-mechanism
)
    - [Fixed Overhead per New Session](
#fixed-overhead-per-new-session
)
  - [Part 2: Testing - Bootstrap File Quantitative Analysis](
#part-2-testing---bootstrap-file-quantitative-analysis
)
    - [File Volume Before Optimization](
#file-volume-before-optimization
)
    - [File Volume After Optimization](
#file-volume-after-optimization
)
    - [Cumulative Consumption Comparison by Usage Scenario](
#cumulative-consumption-comparison-by-usage-scenario
)
  - [Part 3: Optimization - Seven Core Techniques](
#part-3-optimization---seven-core-techniques
)
    - [1. Tree-Structured Document Architecture (Old: Single File → New: Multi-Layer Index)](
#1-tree-structured-document-architecture-old-single-file--new-multi-layer-index
)
      - [Optimization Principle](
#optimization-principle
)
      - [Measured Data](
#measured-data
)
      - [Cost Savings (Monthly)](
#cost-savings-monthly
)
      - [Resource Consumption Changes](
#resource-consumption-changes
)
    - [2. AI Auto-Compression (Compaction)](
#2-ai-auto-compression-compaction
)
      - [Optimization Principle](
#optimization-principle-1
)
      - [Measured Comparison](
#measured-comparison
)
      - [Cost Savings](
#cost-savings
)
      - [Resource Consumption Changes](
#resource-consumption-changes-1
)
    - [3. Local Model Management of Lightweight Tasks (QMD / Ollama)](
#3-local-model-management-of-lightweight-tasks-qmd--ollama
)
      - [Optimization Principle](
#optimization-principle-2
)
      - [QMD Application](
#qmd-application
)
      - [Measured Data](
#measured-data-1
)
      - [Cost Savings](
#cost-savings-1
)
      - [Resource Consumption Changes](
#resource-consumption-changes-2
)
    - [4. Direct Script-to-API Calls, Bypassing Bootstrap](
#4-direct-script-to-api-calls-bypassing-bootstrap
)
      - [Optimization Principle](
#optimization-principle-3
)
      - [Measured Data](
#measured-data-2
)
      - [Resource Consumption Changes](
#resource-consumption-changes-3
)
    - [5. Console Commands Replace LLM Conversation](
#5-console-commands-replace-llm-conversation
)
      - [Optimization Principle](
#optimization-principle-4
)
      - [Practical Application](
#practical-application
)
      - [Resource Consumption Changes](
#resource-consumption-changes-4
)
    - [6. Daily Logic CPU-fication (Python Cron Direct Push)](
#6-daily-logic-cpu-fication-python-cron-direct-push
)
      - [Optimization Principle](
#optimization-principle-5
)
      - [Implemented CPU-fied Tasks](
#implemented-cpu-fied-tasks
)
      - [Measured Comparison](
#measured-comparison-1
)
      - [Technical Implementation](
#technical-implementation
)
      - [Resource Consumption Changes](
#resource-consumption-changes-5
)
    - [7. Intelligent Demands Pulled Back from LLM to CPU (Heartbeat Checklist-ification)](
#7-intelligent-demands-pulled-back-from-llm-to-cpu-heartbeat-checklist-ification
)
      - [Optimization Principle](
#optimization-principle-6
)
      - [Transformation Comparison](
#transformation-comparison
)
      - [Measured Data](
#measured-data-3
)
      - [Cost Savings](
#cost-savings-2
)
      - [Resource Consumption Changes](
#resource-consumption-changes-6
)
  - [Comprehensive Benefit Assessment](
#comprehensive-benefit-assessment
)
    - [Monthly Cost Comparison Summary](
#monthly-cost-comparison-summary
)
    - [Annualized Comparison](
#annualized-comparison
)
    - [Beyond Just Saving Money](
#beyond-just-saving-money
)
  - [Appendix 1: Model Pricing Reference](
#appendix-1-model-pricing-reference
)
  - [Appendix 2: Vectorization of Skill Descriptors](
#appendix-2-vectorization-of-skill-descriptors
)
  - [Conclusion](
#conclusion
)


---


## Part 1: Principles - Hidden Costs of System Prompts


### Bootstrap File Loading Mechanism


Each time `/new` or `/reset` is executed to create a new session, the OpenClaw runtime automatically loads the following content as **System Prompt + Startup Context**:


| File | Loading Method | Purpose |
|------|----------------|---------|
| `AGENTS.md` | System Prompt Injection | Agent behavior instruction tree |
| `SOUL.md` | System Prompt Injection | Personality definition |
| `USER.md` | System Prompt Injection | User information |
| `HEARTBEAT.md` | System Prompt Injection | Scheduled task checklist |
| `TOOLS.md` | System Prompt Injection | Local tool configuration |
| `MEMORY.md` | Startup Context | Long-term memory |
| `memory/*.md` (past 2 days) | Startup Context | Daily work logs (≤2800 characters) |


These files are **not visible in the conversation history**, but **consume actual context window**. Every LLM inference must process this content.


### Context Window and Compaction Mechanism


OpenClaw's compaction mechanism uses a `mode: safeguard` strategy:


- **Trigger Condition**: Automatically triggered when conversation history + bootstrap approach the context limit
- **Compression Method**: Generate summaries of early conversations, retain recent details
- **Problem**: If the bootstrap file itself is large, less space remains for actual conversations, compaction triggers more frequently, and each compaction consumes tokens


### Fixed Overhead per New Session


Using the default model MiniMax M2.7 (200K context window) as an example:


> **Before Optimization**: bootstrap ~25,000 bytes ≈ ~6,250 tokens  
> **After Optimization**: bootstrap ~8,300 bytes ≈ ~2,075 tokens  
> 
> Each session startup saves **~4,175 tokens**, not including subsequent chain effects of compaction in conversations.


The same principle applies to models like DeepSeek V3.2 (200K context). If your daily usage involves frequent `/new` / `/reset` (e.g., task switching, context cleanup), the savings double.


---


## Part 2: Testing - Bootstrap File Quantitative Analysis


> All data below are based on actual file measurements. Sensitive content has been anonymized: usernames → "User A", Agent names → "Agent-X".


### File Volume Before Optimization


| File | Lines | Bytes | Estimated Tokens | Main Content |
|------|-------|-------|------------------|--------------|
| AGENTS.md | ~300 | ~12,000 | ~3,000 | Behavior rules, skill index, memory rules, quick decisions all mixed |
| MEMORY.md | ~200 | ~8,000 | ~2,000 | Holdings info, built systems, technical architecture, user goals |
| SOUL.md | 36 | 1,673 | ~418 | Personality definition |
| USER.md | 11 | 278 | ~70 | Username/timezone/preferences |
| TOOLS.md | 34 | 827 | ~207 | Search toolchain, local configuration |
| HEARTBEAT.md | 28 | 1,681 | ~420 | Heartbeat checklist |
| **Total** | **~609** | **~24,459** | **~6,115** | |


### File Volume After Optimization


| File | Lines | Bytes | Estimated Tokens | Change |
|------|-------|-------|------------------|--------|
| AGENTS.md | 56 | 2,278 | ~570 | ⬇️ **-81%** |
| MEMORY.md | 62 | 1,589 | ~397 | ⬇️ **-80%** |
| SOUL.md | 36 | 1,673 | ~418 | — |
| USER.md | 11 | 278 | ~70 | — |
| TOOLS.md | 34 | 827 | ~207 | — |
| HEARTBEAT.md | 28 | 1,681 | ~420 | — |
| **Total** | **~227** | **~8,326** | **~2,082** | ⬇️ **-66%** |


> Extracted detailed rules moved to `docs/` subdirectory (5 files, 9,452 bytes total), loaded on-demand by LLM via `read` tool, no longer injected with bootstrap.


### Cumulative Consumption Comparison by Usage Scenario


Assuming typical usage patterns:
- **Daily Conversation**: 10 rounds per day, average 500 tokens input + 200 tokens output per round
- **Lightweight Tasks**: 2 tasks per day, 3,000 tokens context each
- **Session Rebuild**: ~3 times per day `/new` or `/reset`


**Monthly Consumption Before Optimization**:
```
Daily conversation: 10 × 700 = 7,000 tokens
Daily tasks: 2 × 3,000 = 6,000 tokens
Bootstrap loading (×3): 3 × 6,115 = 18,345 tokens
────────────────────────────
Daily total: 31,345 tokens
Monthly total: 31,345 × 30 ≈ 940,350 tokens ≈ 0.94M tokens
```


**Monthly Consumption After Optimization**:
```
Daily conversation: 10 × 700 = 7,000 tokens
Daily tasks: 2 × 3,000 = 6,000 tokens  
Bootstrap loading (×3): 3 × 2,082 = 6,246 tokens
────────────────────────────
Daily total: 19,246 tokens
Monthly total: 19,246 × 30 ≈ 577,380 tokens ≈ 0.58M tokens
```


**Bootstrap optimization alone saves ~0.36M tokens/month**. Combined with other optimization techniques, total savings far exceed this number.


---


## Part 3: Optimization - Seven Core Techniques


### 1. Tree-Structured Document Architecture (Old: Single File → New: Multi-Layer Index)


#### Optimization Principle


The more content crammed into an AI Agent's system prompt, the more it must "read before acting" for each inference. Traditional approaches mix all rules, indexes, and memories into one large file (e.g., AGENTS.md with 300 lines), requiring the LLM to process all 300 lines before thinking about your problem.


**Solution**: Shrink AGENTS.md and MEMORY.md to index files (<60 lines), split detailed rules by module into `docs/` subdirectory. The LLM sees only the index at startup and reads specific documents on demand.


```
workspace-qqclaw/
├── AGENTS.md          (56 lines) ← Top-level index, contains document tree
├── MEMORY.md          (62 lines) ← Summary memory
├── docs/
│   ├── OPENROUTER.md  (68 lines)
│   ├── WEB-SEARCH.md  (43 lines)
│   ├── MEMORY-SYSTEM.md (64 lines)
│   ├── TRADE-MONITOR.md (97 lines)
│   └── MULTI-SEARCH.md  (94 lines)
```


#### Measured Data


| Metric | Before Optimization | After Optimization | Savings |
|--------|---------------------|-------------------|---------|
| AGENTS.md tokens | ~3,000 | ~570 | **81%** |
| MEMORY.md tokens | ~2,000 | ~397 | **80%** |
| Bootstrap Total | ~6,115 | ~2,082 | **66%** |
| Total Documentation | ~24,459 bytes | ~17,778 bytes (incl. docs/) | Comparable lines, structural optimization |


#### Cost Savings (Monthly)


Based on Sonnet ($9/MT average) calculation:


```
Before: 6,115 tokens × 3 sessions/day × 30 days = 550,350 tokens × $9/MT = $4.95/month (bootstrap only)
After: 2,082 tokens × 3 sessions/day × 30 days = 187,380 tokens × $9/MT = $1.68/month (bootstrap only)
Savings: $0.27/month
```


Seems small? But this is just the bootstrap single-point saving. The real payoff: **smaller system prompts mean each conversation processes ~4,000 fewer tokens, compaction triggers less frequently, and more conversation rounds fit in the window.**


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| Context per-round inference | ⬇️ ~4,000 tokens |
| Compaction trigger frequency | ⬇️ Delayed trigger (longer effective conversation) |
| LLM response latency | ⬇️ Slight decrease (reduced prompt processing) |
| Documentation maintainability | ⬆️ Improved (modular, changes don't affect others) |


---


### 2. AI Auto-Compression (Compaction)


#### Optimization Principle


OpenClaw's compaction mechanism is like "automatic conversation history summarization":


- When conversation history + system prompt approach the context limit, early conversations are compressed into summaries
- In `mode: safeguard`, the system automatically triggers at the safety boundary
- Compressed content is preserved in summary form, freeing space for new conversations


**Why does this save tokens?** If your context window is 200K tokens, without compression, each inference must process the full conversation history (possibly 50K-100K tokens). After compression, early history becomes a 1K-2K summary, and new conversations only need 10K-30K tokens.


#### Measured Comparison


| Scenario | No Compression | With Compression (safeguard) |
|----------|----------------|------------------------------|
| Context after 100 rounds | ~120,000 tokens | ~25,000 tokens |
| Per-round token consumption | ~1,200 (full) | ~600 (summary + new round) |
| LLM failure critical point | ~170 rounds | Theoretically infinite |


#### Cost Savings


Based on Haiku long conversation scenario:


```
No compression: 100 rounds × 1,200 tokens/round = 120,000 tokens/day × $9/MT × 30 days = $32.4/month
With compression: 100 rounds × 600 tokens/round = 60,000 tokens/day × $9/MT × 30 days = $16.2/month
Savings: $1.33/month
```


Calculated per day with 100 conversation rounds. Combined with other optimizations, total savings are even more impressive.


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| GPU (inference burden) | ⬇️ Prompt processing reduced ~50% |
| Context utilization | ⬆️ Improved (more effective conversations) |
| Summary quality | ⚠️ Summaries may lose details (safeguard mode is conservative, low risk) |


---


### 3. Local Model Management of Lightweight Tasks (QMD / Ollama)


#### Optimization Principle


Not all tasks require large models. OpenClaw supports a tiered model strategy:


| Task Type | Model | Cost | Notes |
|-----------|-------|------|-------|
| Heartbeat Detection | Ollama qwen2.5:3b | **$0** | Local CPU inference |
| Security Audit | Ollama qwen2.5:3b | **$0** | Scans every 5 minutes |
| Memory Retrieval | QMD 2.1.0 | **$0** | Local semantic search |
| Complex Conversation | MiniMax / DeepSeek | Paid | Only complex tasks via cloud |


#### QMD Application


QMD (v2.1.0) is a local embedded vector retrieval engine for semantic search in the memory system:


- **Builtin**: SQLite + FTS5 full-text search (has latency)
- **QMD**: Standalone sidecar process, local vector search (zero latency, zero API cost)


```
QMD Search Process:
memory_search(query) → QMD sidecar → local embedding → return Top-K results
No external API calls throughout, token consumption = 0
```


#### Measured Data


| Metric | Before | After | Savings |
|--------|--------|-------|---------|
| Heartbeat Detection (per 30min) | ~200 tokens/call cloud | 0 tokens (qwen local) | 100% |
| Security Audit (per 5min) | ~500 tokens/call cloud | 0 tokens (qwen local) | 100% |
| Memory Retrieval | ~1,500 tokens/call (cloud semantic search) | 0 tokens (QMD local) | 100% |


#### Cost Savings
Based On GPT5.4 Nano
```
Heartbeat Detection: 48 calls/day × 200 tokens × $0.73/MT × 30 days = $0.21/month
Security Audit: 288 calls/day × 500 tokens × $0.73/MT × 30 days = $3.20/month
Memory Retrieval: 10 calls/day × 1,500 tokens × $0.73/MT × 30 days = $0.33/month
────────────────────────────────────────────────────────
Total Savings: ~$3.74/month
```


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| CPU Usage | ⬆️ Slight increase (qwen 3b + QMD inference) |
| GPU Usage | ⬇️ Significant decrease (high-frequency tasks offline) |
| Network I/O | ⬇️ Fewer API calls |
| Response Speed | ⬆️ Faster local inference (no network latency) |


---


### 4. Direct Script-to-API Calls, Bypassing Bootstrap


#### Optimization Principle


Traditional approach: Have the LLM read workspace context → analyze problem → return result. But many repetitive tasks (like portfolio analysis, market briefs) can **directly call APIs with Python scripts**, bypassing the LLM bootstrap process.


```
Traditional Path (wasting tokens):
cron trigger → LLM load bootstrap(2K tokens) → understand task(500 tokens) → curl API → return(800 tokens)


Optimized Path (zero bootstrap):
cron trigger → Python direct API call → format output → QQ push
Never passes through LLM


or


You use a medium level model like Minimax 2.7 token plan with 10$ monothly. And you ask your agent to rely on Minimax 2.7(Or local LLM). When you need to solve a complicated logic problem or difficult task, you ask your agent on Minimax to take only the necessary text out and send it to openrouter's bigger LLMs, thus bypassing the bootstarp
and the entire history context.
```


**Key Script**: Write a python sciprt such as "ask_openrouter.py",
and put it in your agent's workspace. When you feel the problem is complex and it is better to solve it with bigger and expensive models, you ask your agent to use this script to use openrouter's big models.


```python
# Minimal OpenRouter call — no workspace pollution
# Send pure requests directly, don't load AGENTS/SOUL/MEMORY etc.
```


Similarly, `ask_openrouter_search.py` bypasses the LLM for direct web searches. (Openrouter allows you to add a :online suffix to any model to enable web search)


#### Measured Data


Using "portfolio deep analysis" task as example:


| Path | Tokens per Task | Cost (Sonnet) |
|------|-----------------|----------------|
| Via LLM channel(even without any history) | ~9,000 tokens | $0.081 |
| Python direct call OpenRouter | ~1,200 tokens (API input/output only) | $0.01 |
| **Savings** | **87%** | — |


With 2 analysis tasks daily: **$0.007 × 30 = $0.21/month** (this single task isn't expensive, but the pattern scales to all cron tasks)
If you are using 10 cron to do the news analysis,you save 2.1$/month


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| GPU / LLM Inference | ⬇️ Bypass bootstrap, reduce ~3,300 tokens/task |
| Network Calls | → Flat (still need API calls) |
| Maintainability | ⬆️ Python scripts more controllable than prompt engineering |


---


### 5. Console Commands Replace LLM Conversation


#### Optimization Principle


Many operations don't need LLM involvement at all. For example, restarting services, checking status, running known scripts — use shell's `exec` tool directly, no need for the LLM to "understand" and then "act".


PS: This part requires user's intervention


```
User: "Restart openclaw"
LLM method: Load bootstrap → understand intent → generate command → execute → return result
            (~3,000 tokens wasted)


exec method: Directly execute openclaw gateway restart
             (0 tokens)
```


#### Practical Application


| Scenario | LLM Channel | exec Direct | Savings |
|----------|-------------|-------------|---------|
| Restart Service | ~3,000 tokens | 0 tokens | 100% |
| Check Service Status | ~3,500 tokens | 0 tokens | 100% |
| Run Monitoring Script | ~2,000 tokens | 0 tokens | 100% |
| View Logs | ~2,500 tokens | 0 tokens | 100% |


Based on GPT5.4 nano:
With 5 such operations per day as example: **~14,000 tokens × $0.74/MT × 30 = $0.31/month**


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| LLM Inference | ⬇️ Zero LLM for daily maintenance |
| Response Speed | ⬆️ Immediate execution (no LLM processing latency) |
| Error Rate | ⬇️ Deterministic execution (no hallucination risk) |


---


### 6. Daily Logic CPU-fication (Python Cron Direct Push)


#### Optimization Principle


If high-frequency scheduled tasks (like market monitoring, price pushes) go through LLM each time, token consumption is astronomical. Correct approach:


```
Python script → fetch data → conditional judgment → push directly via QQ/notification channel
Never goes through LLM, GPU untouched
```


#### Implemented CPU-fied Tasks


| Task | Frequency | Method | Savings |
|------|-----------|--------|---------|
| 📊 Intraday Monitoring | Every 10 min | Python `intraday_watch.py` direct IM push | 100% |
| 🪙 BTC/ETH Monitoring | Every 15 min | Python `price_monitor.py` direct IM push | 100% |
| 🌤️ Airticket Check | Every 2 hours | Python `airticket_monitor.py` direct IM push | 100% |
| 🌡️ Weather Forecast | 2x per day | Python `weather_monitor.py` direct IM push | 100% |
| 🔐 Security Scan | Every 30 min | qwen2.5:3b local scan | 100% |


#### Measured Comparison


If these 5 tasks went through LLM (based on Sonnet):


```
Intraday: 144 calls/day × 1,500 tokens = 216,000 tokens/day
BTC Monitoring: 96 calls/day × 1,200 tokens = 115,200 tokens/day
METAR: 8 calls/day × 1,000 tokens = 8,000 tokens/day
Weather: 2 calls/day × 1,000 tokens = 2,000 tokens/day
Security Scan: 288 calls/day × 500 tokens = 144,000 tokens/day
──────────────────────────────────────────────
Daily: 485,200 tokens → 14.6M/month
Monthly Cost: 14.6M × $9/MT = $131/month
```


**Actual Monthly Cost: $0** (all CPU-fied, zero LLM consumption)


#### Technical Implementation


```python
# intraday_check.py Core Logic
# 1. Fetch market data (LongBridge API)
# 2. Calculate volatility
# 3. Conditional judgment: index >1.2% or stock >2%
# 4. subprocess Popen direct IM push (8s timeout to prevent hang)
# 5. Zero LLM tokens throughout
```


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| LLM API Calls | ⬇️ Reduce ~500 calls/day |
| CPU Usage | ⬆️ Python script polling cost |
| Real-time Performance | ⬆️ No need to wait for LLM processing |
| Reliability | ⬆️ Deterministic logic (no hallucinations) |


---


### 7. Intelligent Demands Pulled Back from LLM to CPU (Heartbeat Checklist-ification)


#### Optimization Principle


Heartbeat is OpenClaw's "heartbeat" mechanism — periodically triggers specific tasks. But if heartbeat content is written as vague natural language, the LLM must "understand" it each time.


**Checklist-ification Transformation**: Convert heartbeat prompts into structured execution checklists, run with the lightest model (qwen2.5:3b, locally free), only do simple status confirmation.


#### Transformation Comparison


Before Transformation (traditional heartbeat):
```
"You are a security audit expert, please check system configurations..."
→ LLM (MiniMax) must understand this prompt → inference → execution
→ ~500 tokens/call
```


After Transformation (checklist-ified heartbeat):
```yaml
heartbeat:
  model: ollama/qwen2.5:3b        # Local free model
  lightContext: true               # Don't load full workspace
  prompt: "Execute steps: 1.Read cron file 2.Check key leaks 3.Output HEARTBEAT_OK"
# If output == HEARTBEAT_OK → nothing happens
# If output != HEARTBEAT_OK → push to user
```


#### Measured Data


| Metric | Before | After | Savings |
|--------|--------|-------|---------|
| Heartbeat Model | GPT Nano ($0.75/MT) | qwen local ($0) | 100% |
| Tokens per Step | ~500 | ~200 (but free) | 100% cost |
| Context Mode | full (2K bootstrap) | lightContext (no bootstrap) | Additional 2K savings |
| Security Audit | MiniMax ($0.75/MT) | qwen local ($0) | 100% |


#### Cost Savings
Based on GPT5.4 Nano
```
Heartbeat Detection: 48 calls/day × (200+2000) tokens × $0.74/MT × 30 = $2.34/month
Security Audit: 288 calls/day × (500+2000) tokens × $0.74/MT × 30 = $12.00/month
──────────────────────────────────────────────────────────────
Total Savings: ~$14.34/month
```


> This is just single-agent user savings. Enterprise deployments (multiple agents, multiple workspaces) multiply this by the number of agents.


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| API Cost | ⬇️ Two highest-frequency tasks completely offline |
| CPU Usage | ⬆️ qwen2.5:3b continuous running (~3GB RAM) |
| Security | ⬆️ Security audit independent of external APIs (data stays server-side) |
| Response Latency | ⬆️ Faster (local model inference <50ms) |


---


## Comprehensive Benefit Assessment


### Monthly Cost Comparison Summary
Based on mixed use of Sonnet, GPT 5.4 Nano...
| Optimization | Before | After | Monthly Savings |
|--------------|--------|-------|-----------------|
| ① Tree-Structured Documents | $0.41 | $0.14 | $0.27 |
| ② Compaction | $2.66 (no compression long-term session) | $1.33 | $1.33 |
| ③ Local Models (QMD + Ollama) | $3.74 | $0 | $3.74 |
| ④ Direct Script API 10 crons | $2.4 | $0.3 | $2.1 |
| ⑤ exec Replacing LLM | $0.31 | $0 | $0.31 |
| ⑥ CPU-fying cron Tasks | $131 | $0 | $131 |
| ⑦ Heartbeat Checklist-ification | $14.34 | $0 | $14.34 |
| **Total** | **$154.86** | **$1.77** | **$153.09/month** |


> Above is conservative single-user usage estimation. In actual use, conversation patterns vary, savings fluctuate, but trends are consistent.


### Annualized Comparison


| Metric | Before Optimization | After Optimization |
|--------|---------------------|-------------------|
| Annual API Cost | ~$1829 | ~$200 |
| Annual Savings | — | **~$1629** |
| Carbon Reduction (est.) | — | ~95% API calls reduction |


### Beyond Just Saving Money


| Dimension | Benefit |
|-----------|---------|
| ⚡ **Response Speed** | High-frequency tasks from LLM polling → CPU direct push, latency from seconds to milliseconds |
| 🔒 **Privacy & Security** | Memory retrieval and data audit localized, no need to upload to third-party APIs |
| 🛡️ **Stability** | CPU tasks independent of API availability, won't break due to API downtime |
| 📐 **Maintainability** | Rule files modularized, changing one piece doesn't affect others |
| 🧪 **Testability** | Python scripts can be unit tested, LLM prompts rely only on "feelings" |


---


## Appendix 1: Model Pricing Reference


| Model | Provider | Input $/MT | Output $/MT | Average $/MT |
|-------|----------|-----------|-----------|-------------|
| MiniMax M2.7(fixed number monthly if token plan ) | MiniMax API | $0.279 | $1.20 | $0.74 |
| DeepSeek V3.2 | OpenRouter | $0.252 | $0.378 | $0.315 |
| DeepSeek V4 Flash | OpenRouter | $0.112 | $0.224 | $0.168 |
| DeepSeek V4 Pro | OpenRouter | $0.435 | $0.87 | $0.6525 |
| Gemini Flash 3 | OpenRouter | $0.25 | $1.5 | $0.875 |
| GPT-5.4 Nano | OpenRouter | $0.2 | $1.25 | $0.7 |
| GPT-5.5 Pro | OpenRouter | $30 | $180 | $105 |
| Claude Opus 4.7 | OpenRouter | $30 | $150 | $90 |
| Claude Sonnet 4.6 | OpenRouter | $3 | $15 | $9 |
| Grok 4.3 | OpenRouter | $1.25 | $2.5 | $1.875|
| Qwen2.5:3b | Ollama Local | $0 | $0 | $0 |


---


## Appendix 2: Vectorization of Skill Descriptors


If you install too many skills in openclaw, all skill descriptors will appear in your agent's system prompt. You can have your Agent install a RAG module, combined with openclaw message hooks, to intercept messages before sending to LLM, vectorize them, compare with local skill vector chunks, and only pull relevant skill portions into the system prompt. This can save you tens of thousands to hundreds of thousands of tokens.


---


## Conclusion


The core idea of seven optimization techniques can be summarized in one sentence:


> **Transform LLM from "all-purpose butler" to "expert advisor" — CPU-fy daily operations, let complex reasoning go to large models.**


This is not just a cost-saving strategy, but an architectural philosophy: AI Agent intelligence should be layered — high-frequency low-complexity tasks processed locally on CPU, low-frequency high-complexity tasks processed by cloud large models. This ensures both response speed and privacy while making every API dollar count.


---


*This document is based on OpenClaw 2026.4.23 testing. Data compiled on 2026-05-19 with sensitive information anonymized. Model prices based on current announcement at that time.*
reddit.com
u/dxzzzzzz — 16 hours ago

A comprehensive method to brutally reduce your Agentic AI token cost by at least 95%, aka a summary of current token reduction method

The core concept

1.Organize your bootstrapping files in a tree-like structure. Let LLM indexing information, rather than load all of the agent/tool/skill markdown at once.

Just like how wo store a billion people's reddit ID in our disk and we won't use a list to store and retrieve it.(That's how agent frame work is doing with bootstrapping and system prompting)

We use a B-tree. We reduce the complexity from O(n) to O(log(n)). Same for LLM and make LLM get necessary information

2.Use AI to compress and compact bootstrapping files, and make these files indexing detailed knowledge

3.Layer your models: Use a very lightweight but long context capable model as the primary model. It will be able to capture your idea and understand your intention. Only switch to expensive SOTA models when you really have to deal with intelligence-heavy mission (You are overthrowing theory of general relativity), manually, or let your small model decide

4.Bootstrapping bypassed: Write a simple python script to direct send message to LLM. Ask your light weight model to take the minimum necessary context and information out from your huge chat history. And then send the important stuff to openrouter/GPT/Claude.

5.Using openclaw console commands and server terminal.

/new, /compact,/status,/usage full

Make good use of this commands to reduce token cost and get an idea of how fast you are burning your cash

Directly controlling your server, typing openclaw gateway restart yourself, instead of let AI do it to reduce related token cost

6.CPU-fying task. If most of your day is repeating certain logic, turn it into a python script. If you tend to anaylize stock using a specific technical approach, convert it into a python script or ask your light weight LLM to convert it for you.

Turn your repetitive tasks into crons, this take tasks from GPU to CPU and it will save you hundreds of $ per month.

If you want to analysis a data sheet, you just don't send the sheet to LLM, you ask your LLM to write a python pandas code to get the result.

You have a powerful CPU on your client device. It computes blazing fast, make use of it rather than outsourcing all your intelligence work to a whole rack of GPUs in data center

  1. Deal with heartbeat. Reduce its frequency.

Unless you are feeling deeply depressed, lack friends, or have an intense craving for new notifications from instant messaging apps, you should reduce the frequency of the "Heartbeat" function—or even disable it entirely. Heartbeat proactively pings the LLM using specific logic to prompt a response; however, this response does not actually help you resolve any substantive issues. You are a geek, most of your time is in front of a monitor so you are already always with your agent.

You might even consider offloading Heartbeat tasks to the CPU.

For instance, if you want to scan an agent's metadata to check for the presence of plaintext keys or tokens, there is clearly no need to have an LLM perform this check every half hour—a process that is both costly and insecure. You can resolve this entirely using regular expressions; the CPU can instantly scan all documents within the agent's directory and pinpoint any files containing leaked secrets.

You can simply ask your agent to read this document for you, and he will be able to understand it....You don't need to worry at all. Your agent will explain it to you how to use this method

Note: If you want to read it please switch to markdown version

# OpenClaw Token Optimization Techniques - Complete Analysis


> 
**Author**
: User A · Agent-X  
> 
**Last Updated**
: 2026-05-19  
> 
**Applicable Version**
: OpenClaw 2026.4.23+  
> 
**Target Audience**
: Technical personnel interested in AI Agent / LLM cost optimization  


---


## Table of Contents


- [OpenClaw Token Optimization Techniques - Complete Analysis](
#openclaw-token-optimization-techniques---complete-analysis
)
  - [Table of Contents](
#table-of-contents
)
  - [Part 1: Principles - Hidden Costs of System Prompts](
#part-1-principles---hidden-costs-of-system-prompts
)
    - [Bootstrap File Loading Mechanism](
#bootstrap-file-loading-mechanism
)
    - [Context Window and Compaction Mechanism](
#context-window-and-compaction-mechanism
)
    - [Fixed Overhead per New Session](
#fixed-overhead-per-new-session
)
  - [Part 2: Testing - Bootstrap File Quantitative Analysis](
#part-2-testing---bootstrap-file-quantitative-analysis
)
    - [File Volume Before Optimization](
#file-volume-before-optimization
)
    - [File Volume After Optimization](
#file-volume-after-optimization
)
    - [Cumulative Consumption Comparison by Usage Scenario](
#cumulative-consumption-comparison-by-usage-scenario
)
  - [Part 3: Optimization - Seven Core Techniques](
#part-3-optimization---seven-core-techniques
)
    - [1. Tree-Structured Document Architecture (Old: Single File → New: Multi-Layer Index)](
#1-tree-structured-document-architecture-old-single-file--new-multi-layer-index
)
      - [Optimization Principle](
#optimization-principle
)
      - [Measured Data](
#measured-data
)
      - [Cost Savings (Monthly)](
#cost-savings-monthly
)
      - [Resource Consumption Changes](
#resource-consumption-changes
)
    - [2. AI Auto-Compression (Compaction)](
#2-ai-auto-compression-compaction
)
      - [Optimization Principle](
#optimization-principle-1
)
      - [Measured Comparison](
#measured-comparison
)
      - [Cost Savings](
#cost-savings
)
      - [Resource Consumption Changes](
#resource-consumption-changes-1
)
    - [3. Local Model Management of Lightweight Tasks (QMD / Ollama)](
#3-local-model-management-of-lightweight-tasks-qmd--ollama
)
      - [Optimization Principle](
#optimization-principle-2
)
      - [QMD Application](
#qmd-application
)
      - [Measured Data](
#measured-data-1
)
      - [Cost Savings](
#cost-savings-1
)
      - [Resource Consumption Changes](
#resource-consumption-changes-2
)
    - [4. Direct Script-to-API Calls, Bypassing Bootstrap](
#4-direct-script-to-api-calls-bypassing-bootstrap
)
      - [Optimization Principle](
#optimization-principle-3
)
      - [Measured Data](
#measured-data-2
)
      - [Resource Consumption Changes](
#resource-consumption-changes-3
)
    - [5. Console Commands Replace LLM Conversation](
#5-console-commands-replace-llm-conversation
)
      - [Optimization Principle](
#optimization-principle-4
)
      - [Practical Application](
#practical-application
)
      - [Resource Consumption Changes](
#resource-consumption-changes-4
)
    - [6. Daily Logic CPU-fication (Python Cron Direct Push)](
#6-daily-logic-cpu-fication-python-cron-direct-push
)
      - [Optimization Principle](
#optimization-principle-5
)
      - [Implemented CPU-fied Tasks](
#implemented-cpu-fied-tasks
)
      - [Measured Comparison](
#measured-comparison-1
)
      - [Technical Implementation](
#technical-implementation
)
      - [Resource Consumption Changes](
#resource-consumption-changes-5
)
    - [7. Intelligent Demands Pulled Back from LLM to CPU (Heartbeat Checklist-ification)](
#7-intelligent-demands-pulled-back-from-llm-to-cpu-heartbeat-checklist-ification
)
      - [Optimization Principle](
#optimization-principle-6
)
      - [Transformation Comparison](
#transformation-comparison
)
      - [Measured Data](
#measured-data-3
)
      - [Cost Savings](
#cost-savings-2
)
      - [Resource Consumption Changes](
#resource-consumption-changes-6
)
  - [Comprehensive Benefit Assessment](
#comprehensive-benefit-assessment
)
    - [Monthly Cost Comparison Summary](
#monthly-cost-comparison-summary
)
    - [Annualized Comparison](
#annualized-comparison
)
    - [Beyond Just Saving Money](
#beyond-just-saving-money
)
  - [Appendix 1: Model Pricing Reference](
#appendix-1-model-pricing-reference
)
  - [Appendix 2: Vectorization of Skill Descriptors](
#appendix-2-vectorization-of-skill-descriptors
)
  - [Conclusion](
#conclusion
)


---


## Part 1: Principles - Hidden Costs of System Prompts


### Bootstrap File Loading Mechanism


Each time `/new` or `/reset` is executed to create a new session, the OpenClaw runtime automatically loads the following content as 
**System Prompt + Startup Context**
:


| File | Loading Method | Purpose |
|------|----------------|---------|
| `AGENTS.md` | System Prompt Injection | Agent behavior instruction tree |
| `SOUL.md` | System Prompt Injection | Personality definition |
| `USER.md` | System Prompt Injection | User information |
| `HEARTBEAT.md` | System Prompt Injection | Scheduled task checklist |
| `TOOLS.md` | System Prompt Injection | Local tool configuration |
| `MEMORY.md` | Startup Context | Long-term memory |
| `memory/*.md` (past 2 days) | Startup Context | Daily work logs (≤2800 characters) |


These files are 
**not visible in the conversation history**
, but 
**consume actual context window**
. Every LLM inference must process this content.


### Context Window and Compaction Mechanism


OpenClaw's compaction mechanism uses a `mode: safeguard` strategy:


- 
**Trigger Condition**
: Automatically triggered when conversation history + bootstrap approach the context limit
- 
**Compression Method**
: Generate summaries of early conversations, retain recent details
- 
**Problem**
: If the bootstrap file itself is large, less space remains for actual conversations, compaction triggers more frequently, and each compaction consumes tokens


### Fixed Overhead per New Session


Using the default model MiniMax M2.7 (200K context window) as an example:


> 
**Before Optimization**
: bootstrap ~25,000 bytes ≈ ~6,250 tokens  
> 
**After Optimization**
: bootstrap ~8,300 bytes ≈ ~2,075 tokens  
> 
> Each session startup saves 
**~4,175 tokens**
, not including subsequent chain effects of compaction in conversations.


The same principle applies to models like DeepSeek V3.2 (200K context). If your daily usage involves frequent `/new` / `/reset` (e.g., task switching, context cleanup), the savings double.


---


## Part 2: Testing - Bootstrap File Quantitative Analysis


> All data below are based on actual file measurements. Sensitive content has been anonymized: usernames → "User A", Agent names → "Agent-X".


### File Volume Before Optimization


| File | Lines | Bytes | Estimated Tokens | Main Content |
|------|-------|-------|------------------|--------------|
| AGENTS.md | ~300 | ~12,000 | ~3,000 | Behavior rules, skill index, memory rules, quick decisions all mixed |
| MEMORY.md | ~200 | ~8,000 | ~2,000 | Holdings info, built systems, technical architecture, user goals |
| SOUL.md | 36 | 1,673 | ~418 | Personality definition |
| USER.md | 11 | 278 | ~70 | Username/timezone/preferences |
| TOOLS.md | 34 | 827 | ~207 | Search toolchain, local configuration |
| HEARTBEAT.md | 28 | 1,681 | ~420 | Heartbeat checklist |
| 
**Total**
 | 
**~609**
 | 
**~24,459**
 | 
**~6,115**
 | |


### File Volume After Optimization


| File | Lines | Bytes | Estimated Tokens | Change |
|------|-------|-------|------------------|--------|
| AGENTS.md | 56 | 2,278 | ~570 | ⬇️ 
**-81%**
 |
| MEMORY.md | 62 | 1,589 | ~397 | ⬇️ 
**-80%**
 |
| SOUL.md | 36 | 1,673 | ~418 | — |
| USER.md | 11 | 278 | ~70 | — |
| TOOLS.md | 34 | 827 | ~207 | — |
| HEARTBEAT.md | 28 | 1,681 | ~420 | — |
| 
**Total**
 | 
**~227**
 | 
**~8,326**
 | 
**~2,082**
 | ⬇️ 
**-66%**
 |


> Extracted detailed rules moved to `docs/` subdirectory (5 files, 9,452 bytes total), loaded on-demand by LLM via `read` tool, no longer injected with bootstrap.


### Cumulative Consumption Comparison by Usage Scenario


Assuming typical usage patterns:
- 
**Daily Conversation**
: 10 rounds per day, average 500 tokens input + 200 tokens output per round
- 
**Lightweight Tasks**
: 2 tasks per day, 3,000 tokens context each
- 
**Session Rebuild**
: ~3 times per day `/new` or `/reset`


**Monthly Consumption Before Optimization**
:
```
Daily conversation: 10 × 700 = 7,000 tokens
Daily tasks: 2 × 3,000 = 6,000 tokens
Bootstrap loading (×3): 3 × 6,115 = 18,345 tokens
────────────────────────────
Daily total: 31,345 tokens
Monthly total: 31,345 × 30 ≈ 940,350 tokens ≈ 0.94M tokens
```


**Monthly Consumption After Optimization**
:
```
Daily conversation: 10 × 700 = 7,000 tokens
Daily tasks: 2 × 3,000 = 6,000 tokens  
Bootstrap loading (×3): 3 × 2,082 = 6,246 tokens
────────────────────────────
Daily total: 19,246 tokens
Monthly total: 19,246 × 30 ≈ 577,380 tokens ≈ 0.58M tokens
```


**Bootstrap optimization alone saves ~0.36M tokens/month**
. Combined with other optimization techniques, total savings far exceed this number.


---


## Part 3: Optimization - Seven Core Techniques


### 1. Tree-Structured Document Architecture (Old: Single File → New: Multi-Layer Index)


#### Optimization Principle


The more content crammed into an AI Agent's system prompt, the more it must "read before acting" for each inference. Traditional approaches mix all rules, indexes, and memories into one large file (e.g., AGENTS.md with 300 lines), requiring the LLM to process all 300 lines before thinking about your problem.


**Solution**
: Shrink AGENTS.md and MEMORY.md to index files (<60 lines), split detailed rules by module into `docs/` subdirectory. The LLM sees only the index at startup and reads specific documents on demand.


```
workspace-qqclaw/
├── AGENTS.md          (56 lines) ← Top-level index, contains document tree
├── MEMORY.md          (62 lines) ← Summary memory
├── docs/
│   ├── OPENROUTER.md  (68 lines)
│   ├── WEB-SEARCH.md  (43 lines)
│   ├── MEMORY-SYSTEM.md (64 lines)
│   ├── TRADE-MONITOR.md (97 lines)
│   └── MULTI-SEARCH.md  (94 lines)
```


#### Measured Data


| Metric | Before Optimization | After Optimization | Savings |
|--------|---------------------|-------------------|---------|
| AGENTS.md tokens | ~3,000 | ~570 | 
**81%**
 |
| MEMORY.md tokens | ~2,000 | ~397 | 
**80%**
 |
| Bootstrap Total | ~6,115 | ~2,082 | 
**66%**
 |
| Total Documentation | ~24,459 bytes | ~17,778 bytes (incl. docs/) | Comparable lines, structural optimization |


#### Cost Savings (Monthly)


Based on Sonnet ($9/MT average) calculation:


```
Before: 6,115 tokens × 3 sessions/day × 30 days = 550,350 tokens × $9/MT = $4.95/month (bootstrap only)
After: 2,082 tokens × 3 sessions/day × 30 days = 187,380 tokens × $9/MT = $1.68/month (bootstrap only)
Savings: $0.27/month
```


Seems small? But this is just the bootstrap single-point saving. The real payoff: 
**smaller system prompts mean each conversation processes ~4,000 fewer tokens, compaction triggers less frequently, and more conversation rounds fit in the window.**


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| Context per-round inference | ⬇️ ~4,000 tokens |
| Compaction trigger frequency | ⬇️ Delayed trigger (longer effective conversation) |
| LLM response latency | ⬇️ Slight decrease (reduced prompt processing) |
| Documentation maintainability | ⬆️ Improved (modular, changes don't affect others) |


---


### 2. AI Auto-Compression (Compaction)


#### Optimization Principle


OpenClaw's compaction mechanism is like "automatic conversation history summarization":


- When conversation history + system prompt approach the context limit, early conversations are compressed into summaries
- In `mode: safeguard`, the system automatically triggers at the safety boundary
- Compressed content is preserved in summary form, freeing space for new conversations


**Why does this save tokens?**
 If your context window is 200K tokens, without compression, each inference must process the full conversation history (possibly 50K-100K tokens). After compression, early history becomes a 1K-2K summary, and new conversations only need 10K-30K tokens.


#### Measured Comparison


| Scenario | No Compression | With Compression (safeguard) |
|----------|----------------|------------------------------|
| Context after 100 rounds | ~120,000 tokens | ~25,000 tokens |
| Per-round token consumption | ~1,200 (full) | ~600 (summary + new round) |
| LLM failure critical point | ~170 rounds | Theoretically infinite |


#### Cost Savings


Based on Haiku long conversation scenario:


```
No compression: 100 rounds × 1,200 tokens/round = 120,000 tokens/day × $9/MT × 30 days = $32.4/month
With compression: 100 rounds × 600 tokens/round = 60,000 tokens/day × $9/MT × 30 days = $16.2/month
Savings: $1.33/month
```


Calculated per day with 100 conversation rounds. Combined with other optimizations, total savings are even more impressive.


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| GPU (inference burden) | ⬇️ Prompt processing reduced ~50% |
| Context utilization | ⬆️ Improved (more effective conversations) |
| Summary quality | ⚠️ Summaries may lose details (safeguard mode is conservative, low risk) |


---


### 3. Local Model Management of Lightweight Tasks (QMD / Ollama)


#### Optimization Principle


Not all tasks require large models. OpenClaw supports a tiered model strategy:


| Task Type | Model | Cost | Notes |
|-----------|-------|------|-------|
| Heartbeat Detection | Ollama qwen2.5:3b | 
**$0**
 | Local CPU inference |
| Security Audit | Ollama qwen2.5:3b | 
**$0**
 | Scans every 5 minutes |
| Memory Retrieval | QMD 2.1.0 | 
**$0**
 | Local semantic search |
| Complex Conversation | MiniMax / DeepSeek | Paid | Only complex tasks via cloud |


#### QMD Application


QMD (v2.1.0) is a local embedded vector retrieval engine for semantic search in the memory system:


- 
**Builtin**
: SQLite + FTS5 full-text search (has latency)
- 
**QMD**
: Standalone sidecar process, local vector search (zero latency, zero API cost)


```
QMD Search Process:
memory_search(query) → QMD sidecar → local embedding → return Top-K results
No external API calls throughout, token consumption = 0
```


#### Measured Data


| Metric | Before | After | Savings |
|--------|--------|-------|---------|
| Heartbeat Detection (per 30min) | ~200 tokens/call cloud | 0 tokens (qwen local) | 100% |
| Security Audit (per 5min) | ~500 tokens/call cloud | 0 tokens (qwen local) | 100% |
| Memory Retrieval | ~1,500 tokens/call (cloud semantic search) | 0 tokens (QMD local) | 100% |


#### Cost Savings
Based On GPT5.4 Nano
```
Heartbeat Detection: 48 calls/day × 200 tokens × $0.73/MT × 30 days = $0.21/month
Security Audit: 288 calls/day × 500 tokens × $0.73/MT × 30 days = $3.20/month
Memory Retrieval: 10 calls/day × 1,500 tokens × $0.73/MT × 30 days = $0.33/month
────────────────────────────────────────────────────────
Total Savings: ~$3.74/month
```


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| CPU Usage | ⬆️ Slight increase (qwen 3b + QMD inference) |
| GPU Usage | ⬇️ Significant decrease (high-frequency tasks offline) |
| Network I/O | ⬇️ Fewer API calls |
| Response Speed | ⬆️ Faster local inference (no network latency) |


---


### 4. Direct Script-to-API Calls, Bypassing Bootstrap


#### Optimization Principle


Traditional approach: Have the LLM read workspace context → analyze problem → return result. But many repetitive tasks (like portfolio analysis, market briefs) can 
**directly call APIs with Python scripts**
, bypassing the LLM bootstrap process.


```
Traditional Path (wasting tokens):
cron trigger → LLM load bootstrap(2K tokens) → understand task(500 tokens) → curl API → return(800 tokens)


Optimized Path (zero bootstrap):
cron trigger → Python direct API call → format output → QQ push
Never passes through LLM


or


You use a medium level model like Minimax 2.7 token plan with 10$ monothly. And you ask your agent to rely on Minimax 2.7(Or local LLM). When you need to solve a complicated logic problem or difficult task, you ask your agent on Minimax to take only the necessary text out and send it to openrouter's bigger LLMs, thus bypassing the bootstarp
and the entire history context.
```


**Key Script**
: Write a python sciprt such as "ask_openrouter.py",
and put it in your agent's workspace. When you feel the problem is complex and it is better to solve it with bigger and expensive models, you ask your agent to use this script to use openrouter's big models.


```python
# Minimal OpenRouter call — no workspace pollution
# Send pure requests directly, don't load AGENTS/SOUL/MEMORY etc.
```


Similarly, `ask_openrouter_search.py` bypasses the LLM for direct web searches. (Openrouter allows you to add a :online suffix to any model to enable web search)


#### Measured Data


Using "portfolio deep analysis" task as example:


| Path | Tokens per Task | Cost (Sonnet) |
|------|-----------------|----------------|
| Via LLM channel(even without any history) | ~9,000 tokens | $0.081 |
| Python direct call OpenRouter | ~1,200 tokens (API input/output only) | $0.01 |
| 
**Savings**
 | 
**87%**
 | — |


With 2 analysis tasks daily: 
**$0.007 × 30 = $0.21/month**
 (this single task isn't expensive, but the pattern scales to all cron tasks)
If you are using 10 cron to do the news analysis,you save 2.1$/month


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| GPU / LLM Inference | ⬇️ Bypass bootstrap, reduce ~3,300 tokens/task |
| Network Calls | → Flat (still need API calls) |
| Maintainability | ⬆️ Python scripts more controllable than prompt engineering |


---


### 5. Console Commands Replace LLM Conversation


#### Optimization Principle


Many operations don't need LLM involvement at all. For example, restarting services, checking status, running known scripts — use shell's `exec` tool directly, no need for the LLM to "understand" and then "act".


PS: This part requires user's intervention


```
User: "Restart openclaw"
LLM method: Load bootstrap → understand intent → generate command → execute → return result
            (~3,000 tokens wasted)


exec method: Directly execute openclaw gateway restart
             (0 tokens)
```


#### Practical Application


| Scenario | LLM Channel | exec Direct | Savings |
|----------|-------------|-------------|---------|
| Restart Service | ~3,000 tokens | 0 tokens | 100% |
| Check Service Status | ~3,500 tokens | 0 tokens | 100% |
| Run Monitoring Script | ~2,000 tokens | 0 tokens | 100% |
| View Logs | ~2,500 tokens | 0 tokens | 100% |


Based on GPT5.4 nano:
With 5 such operations per day as example: 
**~14,000 tokens × $0.74/MT × 30 = $0.31/month**


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| LLM Inference | ⬇️ Zero LLM for daily maintenance |
| Response Speed | ⬆️ Immediate execution (no LLM processing latency) |
| Error Rate | ⬇️ Deterministic execution (no hallucination risk) |


---


### 6. Daily Logic CPU-fication (Python Cron Direct Push)


#### Optimization Principle


If high-frequency scheduled tasks (like market monitoring, price pushes) go through LLM each time, token consumption is astronomical. Correct approach:


```
Python script → fetch data → conditional judgment → push directly via QQ/notification channel
Never goes through LLM, GPU untouched
```


#### Implemented CPU-fied Tasks


| Task | Frequency | Method | Savings |
|------|-----------|--------|---------|
| 📊 Intraday Monitoring | Every 10 min | Python `intraday_watch.py` direct IM push | 100% |
| 🪙 BTC/ETH Monitoring | Every 15 min | Python `price_monitor.py` direct IM push | 100% |
| 🌤️ Airticket Check | Every 2 hours | Python `airticket_monitor.py` direct IM push | 100% |
| 🌡️ Weather Forecast | 2x per day | Python `weather_monitor.py` direct IM push | 100% |
| 🔐 Security Scan | Every 30 min | qwen2.5:3b local scan | 100% |


#### Measured Comparison


If these 5 tasks went through LLM (based on Sonnet):


```
Intraday: 144 calls/day × 1,500 tokens = 216,000 tokens/day
BTC Monitoring: 96 calls/day × 1,200 tokens = 115,200 tokens/day
METAR: 8 calls/day × 1,000 tokens = 8,000 tokens/day
Weather: 2 calls/day × 1,000 tokens = 2,000 tokens/day
Security Scan: 288 calls/day × 500 tokens = 144,000 tokens/day
──────────────────────────────────────────────
Daily: 485,200 tokens → 14.6M/month
Monthly Cost: 14.6M × $9/MT = $131/month
```


**Actual Monthly Cost: $0**
 (all CPU-fied, zero LLM consumption)


#### Technical Implementation


```python
# intraday_check.py Core Logic
# 1. Fetch market data (LongBridge API)
# 2. Calculate volatility
# 3. Conditional judgment: index >1.2% or stock >2%
# 4. subprocess Popen direct IM push (8s timeout to prevent hang)
# 5. Zero LLM tokens throughout
```


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| LLM API Calls | ⬇️ Reduce ~500 calls/day |
| CPU Usage | ⬆️ Python script polling cost |
| Real-time Performance | ⬆️ No need to wait for LLM processing |
| Reliability | ⬆️ Deterministic logic (no hallucinations) |


---


### 7. Intelligent Demands Pulled Back from LLM to CPU (Heartbeat Checklist-ification)


#### Optimization Principle


Heartbeat is OpenClaw's "heartbeat" mechanism — periodically triggers specific tasks. But if heartbeat content is written as vague natural language, the LLM must "understand" it each time.


**Checklist-ification Transformation**
: Convert heartbeat prompts into structured execution checklists, run with the lightest model (qwen2.5:3b, locally free), only do simple status confirmation.


#### Transformation Comparison


Before Transformation (traditional heartbeat):
```
"You are a security audit expert, please check system configurations..."
→ LLM (MiniMax) must understand this prompt → inference → execution
→ ~500 tokens/call
```


After Transformation (checklist-ified heartbeat):
```yaml
heartbeat:
  model: ollama/qwen2.5:3b        # Local free model
  lightContext: true               # Don't load full workspace
  prompt: "Execute steps: 1.Read cron file 2.Check key leaks 3.Output HEARTBEAT_OK"
# If output == HEARTBEAT_OK → nothing happens
# If output != HEARTBEAT_OK → push to user
```


#### Measured Data


| Metric | Before | After | Savings |
|--------|--------|-------|---------|
| Heartbeat Model | GPT Nano ($0.75/MT) | qwen local ($0) | 100% |
| Tokens per Step | ~500 | ~200 (but free) | 100% cost |
| Context Mode | full (2K bootstrap) | lightContext (no bootstrap) | Additional 2K savings |
| Security Audit | MiniMax ($0.75/MT) | qwen local ($0) | 100% |


#### Cost Savings
Based on GPT5.4 Nano
```
Heartbeat Detection: 48 calls/day × (200+2000) tokens × $0.74/MT × 30 = $2.34/month
Security Audit: 288 calls/day × (500+2000) tokens × $0.74/MT × 30 = $12.00/month
──────────────────────────────────────────────────────────────
Total Savings: ~$14.34/month
```


> This is just single-agent user savings. Enterprise deployments (multiple agents, multiple workspaces) multiply this by the number of agents.


#### Resource Consumption Changes


| Resource | Change |
|----------|--------|
| API Cost | ⬇️ Two highest-frequency tasks completely offline |
| CPU Usage | ⬆️ qwen2.5:3b continuous running (~3GB RAM) |
| Security | ⬆️ Security audit independent of external APIs (data stays server-side) |
| Response Latency | ⬆️ Faster (local model inference <50ms) |


---


## Comprehensive Benefit Assessment


### Monthly Cost Comparison Summary
Based on mixed use of Sonnet, GPT 5.4 Nano...
| Optimization | Before | After | Monthly Savings |
|--------------|--------|-------|-----------------|
| ① Tree-Structured Documents | $0.41 | $0.14 | $0.27 |
| ② Compaction | $2.66 (no compression long-term session) | $1.33 | $1.33 |
| ③ Local Models (QMD + Ollama) | $3.74 | $0 | $3.74 |
| ④ Direct Script API 10 crons | $2.4 | $0.3 | $2.1 |
| ⑤ exec Replacing LLM | $0.31 | $0 | $0.31 |
| ⑥ CPU-fying cron Tasks | $131 | $0 | $131 |
| ⑦ Heartbeat Checklist-ification | $14.34 | $0 | $14.34 |
| 
**Total**
 | 
**$154.86**
 | 
**$1.77**
 | 
**$153.09/month**
 |


> Above is conservative single-user usage estimation. In actual use, conversation patterns vary, savings fluctuate, but trends are consistent.


### Annualized Comparison


| Metric | Before Optimization | After Optimization |
|--------|---------------------|-------------------|
| Annual API Cost | ~$1829 | ~$200 |
| Annual Savings | — | 
**~$1629**
 |
| Carbon Reduction (est.) | — | ~95% API calls reduction |


### Beyond Just Saving Money


| Dimension | Benefit |
|-----------|---------|
| ⚡ 
**Response Speed**
 | High-frequency tasks from LLM polling → CPU direct push, latency from seconds to milliseconds |
| 🔒 
**Privacy & Security**
 | Memory retrieval and data audit localized, no need to upload to third-party APIs |
| 🛡️ 
**Stability**
 | CPU tasks independent of API availability, won't break due to API downtime |
| 📐 
**Maintainability**
 | Rule files modularized, changing one piece doesn't affect others |
| 🧪 
**Testability**
 | Python scripts can be unit tested, LLM prompts rely only on "feelings" |


---


## Appendix 1: Model Pricing Reference


| Model | Provider | Input $/MT | Output $/MT | Average $/MT |
|-------|----------|-----------|-----------|-------------|
| MiniMax M2.7(fixed number monthly if token plan ) | MiniMax API | $0.279 | $1.20 | $0.74 |
| DeepSeek V3.2 | OpenRouter | $0.252 | $0.378 | $0.315 |
| DeepSeek V4 Flash | OpenRouter | $0.112 | $0.224 | $0.168 |
| DeepSeek V4 Pro | OpenRouter | $0.435 | $0.87 | $0.6525 |
| Gemini Flash 3 | OpenRouter | $0.25 | $1.5 | $0.875 |
| GPT-5.4 Nano | OpenRouter | $0.2 | $1.25 | $0.7 |
| GPT-5.5 Pro | OpenRouter | $30 | $180 | $105 |
| Claude Opus 4.7 | OpenRouter | $30 | $150 | $90 |
| Claude Sonnet 4.6 | OpenRouter | $3 | $15 | $9 |
| Grok 4.3 | OpenRouter | $1.25 | $2.5 | $1.875|
| Qwen2.5:3b | Ollama Local | $0 | $0 | $0 |


---


## Appendix 2: Vectorization of Skill Descriptors


If you install too many skills in openclaw, all skill descriptors will appear in your agent's system prompt. You can have your Agent install a RAG module, combined with openclaw message hooks, to intercept messages before sending to LLM, vectorize them, compare with local skill vector chunks, and only pull relevant skill portions into the system prompt. This can save you tens of thousands to hundreds of thousands of tokens.


---


## Conclusion


The core idea of seven optimization techniques can be summarized in one sentence:


> 
**Transform LLM from "all-purpose butler" to "expert advisor" — CPU-fy daily operations, let complex reasoning go to large models.**


This is not just a cost-saving strategy, but an architectural philosophy: AI Agent intelligence should be layered — high-frequency low-complexity tasks processed locally on CPU, low-frequency high-complexity tasks processed by cloud large models. This ensures both response speed and privacy while making every API dollar count.


---


*This document is based on OpenClaw 2026.4.23 testing. Data compiled on 2026-05-19 with sensitive information anonymized. Model prices based on current announcement at that time.*
reddit.com
u/dxzzzzzz — 17 hours ago

We may build too many data centers, from a computer nerd's point of view.

Currently, the market is deep in FOMO called "compute thirst," pushing the semiconductor sector, especially chips, to historic highs. The FOMO crowd, based on today's extremely primitive and crude "vibe coding" habits, has linearly extrapolated a future where compute will never be enough.

But if we pierce through the noise and start from basic computer science and underlying architecture, the logic behind this Semi-FOMO (fear of missing out on semiconductors) frenzy is built on quicksand. Most of the current hardware shortage is fake demand, wildly inflated by inefficient software architecture, outdated math, and capital narratives.

  1. The GPU "Shortage": Token Bubbles Inflated Ten Thousand Times

Take a typical Tech-FOMO-heavy tool like "OpenClaw." The root cause of its ten-thousand-fold token explosion is a brutally crude context management mechanism. To maintain so-called "character immersion" and long-term memory, every dialogue round mindlessly pumps in massive system prompts: Agent setting files (SOUL.md, IDENTITY.md), user profiles (USER.md), and huge long-term memory dumps (MEMORY.md). These verbose, static files, plus chat history and tool outputs, are all resubmitted blindly. Within a few turns, the context balloons to millions of tokens. For most daily life and office scenarios, this ocean of context offers near-zero marginal improvement to answers. It's just burning cloud GPU HBM bandwidth for no good reason.

Look at the professional software industry that actually uses LLMs heavily. Tools like GitHub Copilot and Cursor keep their system prompts super lean. The "shortage" built on this mindless, brute-force feeding is purely fake demand—easily fixed with software. Just introduce a sensible session initialization scheme: load core configs only at startup, cut auto-loading of long history and memory dumps, and fetch context on-demand with dynamic retrieval (e.g., memory_search()). That instantly wipes out pointless compute waste.

Let's do a brutally simple number comparison for a typical 50-turn office chat:

[Under unoptimized, crude OpenClaw architecture]

System files (SOUL.md, IDENTITY.md, USER.md etc.) total 5,000 tokens.

Static user memory (MEMORY.md) – all past project backgrounds, code snippets, long texts – conservative estimate 50,000 tokens.

Each turn (user query + agent reply + tool logs) averages 1,000 tokens.

No smart truncation or dynamic retrieval, so the 55,000 base tokens are fully resubmitted every single turn, plus all previous history.

Turn 1 tokens sent: 55,000 (base) + 1,000 (current) = 56,000.

Turn 50 tokens sent: 55,000 + 50*1,000 = 105,000.

Because of arithmetic series accumulation, total tokens consumed over 50 turns: 4.025 million. That means a huge API bill for the user, and precious GPU HBM bandwidth choked by pointless static duplicate data.

[Under professionally optimized architecture – like Cursor's logic]

System prompt aggressively trimmed, only essential instructions → compressed under 1,000 tokens.

Long-term memory is never loaded directly. Pre-vectorized into local lightweight DB. When user mentions a relevant issue, agent triggers memory_search(), pulls only the 3-5 most relevant code blocks or text snippets → extra ≤1,500 tokens.

Chat history uses sliding window or native KV cache reuse → only last 5 turns explicitly exposed → ~5,000 tokens.

So whether it's turn 1 or 50, tokens sent per request are rock-solid capped at: 1,000 (base) + 1,500 (dynamic) + 5,000 (history) + 1,000 (new turn) = ~8,500 tokens.

Same 50-turn conversation: total token consumption only 425,000.

  1. Layered Architecture: A 95% Cliff Drop in VRAM Demand

Another absurdity: using a sledgehammer to crack a nut – sending every user prompt mindlessly to the most expensive, trillion-parameter supermodel.

For most daily problems, a mid-sized 40B–100B model is already overkill. Add web search and local RAG, and domain-specific models can run entirely on edge devices or local networks. Only truly long-tail tasks – super-long context, multiple domains, heavy latent semantic inference – need the cloud giants. And given real human business needs and mental limits, those high-intensity tasks are extremely rare.

In future competition, the cost war will be won by solutions using smart routing (e.g., AIOS). Once this on-demand calling becomes industry consensus, average VRAM consumption per user worldwide will drop from multiple terabytes to ~0.05TB. Cloud LLM compute demand will plummet over 95%.

  1. The Deliberately Hyped CPU Demand: A Gross Violation of CS Common Sense

Recent talk about a "booming" data center CPU demand is purely bullying non-experts. Going from 8:1 GPU:CPU ratio to 4:1 is reasonable. But pushing to 2:1 or even 1:1 violates fundamental computer engineering.

The hype rests on two false claims: massive task scheduling, and agent tasks running in the cloud. Both collapse under simple math.

[1] Task scheduling's zero-cost reality

Take a typical data center CPU: AMD EPYC 9754 (Genoa). 128 physical cores (256 threads), 2.25 GHz base, 3.1 GHz boost. Even with a lazy FIFO queue, each enqueue/dequeue (with lock allocation, pointer moves, state update) costs at most 500–1000 clock cycles. One core at 2.25 GHz processes 2.25e9 cycles per second. One scheduling task: 1000 cycles / 2.25e9 = ~444 nanoseconds. Now imagine a psychotic flood of 1,000,000 agent tasks per second. That needs: 1,000,000 * 444 ns = 0.44 seconds of single-core CPU time. With 128 cores, you haven't even used half of one core. The other 127.5 cores sit idle, watching Netflix. "Task management" demanding a 1:1 CPU purchase ratio is a mathematical joke.

[2] Local execution is already here

The argument that everyday agent tasks (reading/writing docs, editing code, shell commands, file preprocessing) should become "cloud CPU load" ignores edge computing power.

Take a mundane M3 MacBook Air: 8 cores, up to 4.05 GHz, ~1.5 TFLOPS FP32. Or an Intel i7-13700K: 16 cores, 24 threads, 5.4 GHz.

Processing a 1MB PDF or Python source file: string walks, regex, memory copies – a few million to ten million low-level instructions. On a 4 GHz client CPU with high IPC, that's 2–3 milliseconds. Shell commands, unzipping – bottleneck is SSD I/O. Modern PCIe 4.0 NVMe drives do 5-7 GB/s. These tasks are nearly instant. Your local CPU cores serve only you – no cloud queuing. One local PC can simultaneously run hundreds of background agents doing heavy RAG vector searches and file ops, CPU usage maybe under 15%. Packaging these millisecond-scale, perfectly edge-absorbable tasks as "massive cloud CPU pressure" is architectural fraud, pure anxiety-marketing to sell more servers.

  1. The Dense Matrix Dead End: 90% of cold neurons Will Idle

Current SOTA models are brutally inefficient because the industry has doubled down on dense KV matrices (Transformer). Academics have long known that highly skewed models are extremely sparse – at any moment, only about 1/10 of neurons actually need to compute. But instead of using near-lossless compression like SVD, LoRA, or embracing sparse MoE architectures, we're stuck on brute force.

Every dense KV matrix model can be sparsified. When sparse Transformers and matrix dimensionality reduction go mainstream – and they will – 9/10 of today's cloud compute investments will sit idle, staring at the wall.

  1. Unified Memory Architecture Flips the Table: A Zero-Sum Game Within Semis

The current cloud compute moat is built on a physical defect: separate system RAM and VRAM in traditional x86. But ARM + Apple Silicon (Mac) unified memory architecture (UMA) is about to demolish that moat from the hardware floor.

Wall Street bundles the whole semiconductor sector together, ignoring the brutal internal battle: if memory wins – making high-bandwidth, high-capacity memory dirt cheap – then logic/GPU loses massively. And vice versa.

PCI-E bottleneck dies: traditional PCs and data centers suffer under PCI-E bus limits, forcing expensive HBM-equipped GPUs. UMA kills that bottleneck. CPU, GPU, NPU share one massive memory pool. The von Neumann bottleneck of data shuffling vanishes.

512GB singularity & 120B OSS charging: once edge devices commonly have 512GB+ unified memory, cloud inference collapses. For inference (not training/fine-tuning), a UMA machine can locally run a 120B-class open-source model (e.g., optimized Llama-3 variant or large MoE, even GPT-120B-OSS). Those 120B beasts are already approaching or matching cloud SOTA on complex reasoning, code, long-text understanding.

Cloud revenue absolute zero: when a local 120B model + local vector DB + real-time web search performs indistinguishably from a trillion-parameter cloud model on professional knowledge, the killer blow is: marginal inference cost = zero. Cloud API providers will see revenue wiped out from serious professional users.

  1. The Absurdity of Trampling Software Stocks, and the O(N^2) Doom Curse

As the market chases semiconductor FOMO, it's been trampling software and old-school tech stocks into the dirt – utterly ridiculous. Any real CS grad knows computers are organic integrations of software and hardware.

Any single software architectural advance (e.g., layered scheduling of different model sizes) delivers disproportionate efficiency gains and user value. There's no perfect operator or perfect hardware for everything. No Free Lunch Theorem holds: no one model works best on all tasks. If AI paradigm shifts again (e.g., neuromorphic systems inspired by the human brain), today's frantic data center spending will get slaughtered. Why? Because Transformer attention is O(N^2) – among the most uneconomical computational patterns possible. The day someone breaks the O(N^2) curse with better algorithms, global VRAM demand will get square-rooted. Or higher-rooted.

Epilogue: The Pricing Absurdity from 1GB to 1000GB, and the Geeks' "Dark Compute" Party

To invest in tech, you really should understand tech. Blindly chasing Nasdaq in FOMO is dangerous, because paradigm shifts can wreck companies or whole sectors overnight.

Long term, yes, we need to build cloud and edge AI compute for the world. But normally, infrastructure cycles last 10–20 years (Wall Street won't build data centers for developing countries for free – real economy needed). Short term, AI data centers for North America already provide about 1–2GB of VRAM per person – that's the actual footprint so far. But the current MAG10 market cap prices in something like 80–100GB of HBM VRAM per capita for developed nations (and that's based on Hopper/Ada Lovelace-era hardware). That's... a bit crazy. Not everyone will be hammering the cloud 24/7. As process nodes shrink, the same ten trillion dollars will produce a parabolic rise in "VRAM per capita." When it passes 1000GB per person, the only way that's not absurd is if every human brain, 24x7, dreams up insane cross-domain, million-token context tasks requiring deep latent inference – and sends them all to the cloud to fill every last gigabyte.

When the dust settles, the geeks who understood the low-level logic and watched from their local PCs and PCI-E trenches will likely, like after the Dot-Com bust, inherit mountains of cloud hardware. All those surplus XPU hours will become free firewood for building a new world.

reddit.com
u/dxzzzzzz — 6 days ago