u/digitalhobbit

Please critique my planned Hermes Agent setup

Please critique my planned Hermes Agent setup

After briefly experimenting with Hermes Agent on my Macbook (with a local Gemma 4 model), I'm about to set up a more robust instance on a VPS. I'd appreciate your input on my config - both the choices I already pinned down and the ones I'm still deciding on (such as the specific LLMs).

For context: My use cases are still pretty open ended and exploratory, and will evolve over time. While I have a lot of experience with coding agents (Claude Code in particular) and I've built my own agentic pipelines, I'm fairly new to more general, always-on agents like Hermes. Never used OpenClaw etc. I envision using Hermes for extended research tasks (say into YouTube stats and patterns, app opportunities, gamedev, various niches, etc.) as well as daily briefings, using both my calendar and Todoist as well as current news. I want Hermes to be able to generate and share markdown reports and potentially other files (images, videos, PDFs) as well. I'd like to be able to forward emails to Hermes to have it act on them. I'm less likely to use it for coding directly, but will reassess this later.

Below is a high level list of my planned config, with some more details below:

  • VPS: Hetzner CPX22 (2 VCPU, 4 GB RAM, 80 GB SSD
  • LLM Provider: OpenRouter
  • LLMs: TBD, see below
  • Memory Provider: ByteRover
  • Web Search Provider: TBD; Tavily or Firecrawl
  • Long Term Data Storage: Obsidian Vault and Google Drive (details below)
  • Messaging: Telegram

More details on some of these choices and open questions below.

VPS:

Hetzner because of competitive pricing. (I've mostly used DigitalOcean in the past, but their 4 GB instance is 2.5x the cost.)

I believe CPX22 with 4 GB RAM and 2 VCPUs is the right sweet spot for my needs?

LLMs:

This will likely require some experimentation. For custom apps, I've mostly used different flavors of Gemini (e.g. Gemini 2.5 Pro, Flash, and Flash Lite, depending on use case). So that's definitely a contender. Flash might be a good default model.

DeepSeek V4 also seems attractive, primarily because of the low cost.

Open to open source models like Qwen, Gemma 4, or Kimi 2 as well.

I'll read through more of the recommendations in this subreddit, but let me know if you have any particular recommendations for combos that have worked well for you.

Memory:

After researching the officially supported options, I landed on ByteRover. I like the file based approach and git semantics, as well as the tiered search. At least on paper, it seems more than suitable for my needs. I'd just use the local setup, with backup to Github.

I considered Hindsight, as the idea of a knowledge graph sounds compelling, and I've had great results with Postgres and pgvector for my own apps. But realistically, this is overkill for my needs right now.

Web Search:

Firecrawl and Tavily seem like the most popular options. Tavily seems to have the more generous free tier, but Firecrawl seems more commonly suggested. Any thoughts on the trade-offs here? Any alternative recommendations?

Data Storage:

A combination of Obsidian and Google Drive.

I already use Obsidian as my personal note taking app. I would set up a separate vault for Hermes Agent and sync this to a Github repo. I envision using this for more detailed, longer term data. Things like research reports, daily briefings, etc.

I want to explore ByteRover's "swarm" feature as well. It sounds like it can perform federated searches across its own memory and my Obsidian vault, which sounds compelling.

Google Drive is already my main cloud storage for personal and business related files. I would give Hermes read-only access to specific folders that might be needed for certain tasks. I would only give it write access to a dedicated "Hermes" folder.

I realize there are several ways to set up Google Drive support. I lean towards using the official Google Workspace CLI; see below.

Google Workspace:

I would create a dedicated Hermes account under my existing Google Workspace domain. That way, I can cleanly provision access to Google Drive, Email, Calendar, etc.

The official Google Workspace CLI sounds like the cleanest solution. That way, I can not only access Google Drive, but also Email etc. The CLI comes with agent skills that should make the Hermes integration pretty straightforward and robust. I should even be able to leverage ModelArmor to scan incoming emails to prevent prompt injection.

Other Integrations:

Telegram for messaging. (Perhaps Discord in the future.)

Todoist; haven't looked into plugins / MCPs / APIs yet.

Please let me know if you have any feedback or suggestions for improvement on this config. Thanks! Looking forward to getting deeper into Hermes Agent and uncovering more use cases over time. 😄

(Edit: Added a section for Web Search.)

u/digitalhobbit — 10 hours ago

I'm relatively new to running things locally - most of my AI work so far has been against Gemini APIs - but I spent a weekend building a small recipe-generator app using a fully local stack to get a feel for it. Wanted to share a few things I bumped into and ask for input from people with more mileage.

Stack: Gemma 4 via Ollama, Pydantic AI for structured output, FLUX.1-schnell via diffusers for images. Running on a 4090 with 24GB VRAM, i9-13900k CPU, 64GB RAM.

A few observations:

E4B ended up being my best fit, which surprised me. I originally assumed I'd want the largest variant I could fit (so 31B, or maybe the 26B MoE). But for structured output via Pydantic AI, E4B was both faster and more reliable. The larger variants weren't just slower; they actually failed more often. I'd bump into repetition collapse: the model getting stuck in loops of repeated tokens or nonsense strings instead of producing valid JSON. My guess is that the larger Gemma 4 variants are more strongly tuned for thinking-mode behavior, and constraining them to immediate structured output pushes them somewhere they don't handle well. Curious if anyone else has seen this and found ways around it.

Here's an example of the nonsense output that 26B and 31B generated (the app is supposed to return a list of suggested dishes to choose from):

Suggested dishes:
  1. Crispy Tofu Stir-Fry with Rainbow Veggie Medley- Medley- Medley- Med             — Pan-seared pan-seared pan-seared pan-seared pan-seared pan-seared pan-seared pan-seared pan-seared pan-seared pan-seared pan-seared pan-seared pan-seared pan-seared pan-sedescription_of_one_line_of_s_
  2. Thai-Green-Curry-with-Silken-Tofu-and-Green-Veggie-Crunch-Crunch-Crunch-Crunch- — Creamy, coconut-based curry-curry-curry-ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,ly,|
  3. Sesame-Seared-Tofu-Banh-Mi-with-Dpickled-stuffed-stuffed-stuffed-stuffed-stuffed — A crusty baguette-baguette-baguette-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stuffed-stof

Pydantic AI's ToolOutput was unreliable, but NativeOutput worked. Pydantic AI defaults to tool calling for structured output, which works great for me with Gemini. Against Ollama / Gemma 4, I was getting frequent failures - sometimes empty responses, sometimes tool calls that didn't validate. Switching to NativeOutput (which maps to Ollama's format parameter with a JSON schema, i.e. server-side constrained decoding) made it solid. Dropping the temperature to 0.2 also helped.

My read is that smaller models fumble the meta-task of "format a tool call correctly," whereas constrained decoding just forces tokens that fit the schema. But I'd love to hear if folks running larger local models stick with tool-calling or also prefer native structured output.

The uv + PyTorch CUDA gotcha. This one might be obvious to people who've been here a while, but it caught me off guard. Every time I ran uv sync, uv silently reverted PyTorch to the CPU build. The fix was to pin the CUDA wheel index in pyproject.toml:

[[tool.uv.index]]
url = "https://download.pytorch.org/whl/cu126"
name = "pytorch-cuda"
explicit = true

[tool.uv.sources]
torch = { index = "pytorch-cuda" }
torchvision = { index = "pytorch-cuda" }

After that, it stuck.

FLUX.1-schnell was a pleasant surprise. A few seconds per image on the 4090, no offloading tricks needed. Quality is good enough that I haven't felt the urge to try FLUX-dev yet.

Overall I came away pretty optimistic. The quality isn't quite at Gemini 2.5 Pro level for the writing parts, but it's a lot closer than I expected, and the speed on consumer hardware is fine. I'm starting to think about which parts of my actual production pipeline could move local. Curious what others have found, especially anyone who's tried mixing local for high-volume cheap steps and cloud for the heavier reasoning.

Recorded the whole build (debugging included) as a video if anyone wants to see the messy version: https://youtu.be/tXbBnkdemqE.
Proof of concept code is here: https://github.com/digitalhobbit/gammavibe-labs/tree/main/local-recipe-generator.

reddit.com
u/digitalhobbit — 13 days ago