r/ollama

🔥 Hot ▲ 70 r/ollama+1 crossposts

Ollama Gemma4:31b on 3090 - FP,Q8,Q4 Benchmark

I was looking for user benchmarks this morning to see what others have been able to do on their 3090s. Nothing seemed to exist anywhere so I had Claude run them.

In case anyone is interested:

Gemma 4 31B Dense — Flash Attention + Q4 KV Cache on RTX 3090 (24GB)

Two Ollama env vars completely transformed this model's usability. The dense model went from a 16K context ceiling at 15 tok/s to full speed through 128K.

Before (FP16 KV, no Flash Attention):

Context tok/s VRAM
8K 15.4 22,166 MiB
16K 15.4 23,590 MiB
32K 7.5 ⚠️ 23,950 MiB
64K 3.8 ⚠️ 23,660 MiB

After (FA + Q4_0 KV Cache):

Context tok/s VRAM
8K 29.8 20,960 MiB
16K 29.8 21,136 MiB
32K 29.6 21,528 MiB
64K 29.6 22,312 MiB
100K 29.5 23,246 MiB
128K 29.6 23,930 MiB
200K 14.1 ⚠️ 23,630 MiB
256K 10.0 ⚠️ 24,110 MiB

Config (add to Ollama systemd service):

OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q4_0

Why it works so well on Gemma 4: Only 10 of 60 layers use global attention with full KV cache. The other 50 use sliding window (512-1024 tokens), so the KV cache barely grows with context. Q4 quantization on an already-small KV cache keeps everything in VRAM through 128K with zero CPU offload.

The Gemma 3 KV cache speed bug (which dropped tok/s by 80%) does not appear on Gemma 4 with Ollama 0.20.2.

Hardware: RTX 3090 24GB, Ollama 0.20.2, gemma4:31b Q4_K_M

reddit.com
u/---NiKoS--- — 16 hours ago
OpenStitch, open-source AI UI prototyping tool that runs locally with Ollama
▲ 7 r/ollama

OpenStitch, open-source AI UI prototyping tool that runs locally with Ollama

https://reddit.com/link/1sci25b/video/fpqaqqnjn6tg1/player

Built this over the past few days. You describe a screen (or drop a screenshot, or sketch a wireframe) and it generates rendered, interactive  frontend code on an infinite canvas. Link screens into flows and prototype them in-app.           

Runs fully local with Ollama. No cloud, no accounts. OpenRouter works too  if you want stronger vision models.                                                                                                                   

Main workflows:                                                            

  - Generate: describe a full product, get multiple screens with a shared design system

  - Screenshot to UI: drop a screenshot or wireframe sketch, get a code replica     

  - Iterate: refine any screen with follow-up prompts

Stack: React + FastAPI + SQLite + Ollama. Runs via Docker Compose.                                    

Tested with Qwen3-coder:30b for code and Qwen3.5-122B-A10B for vision.                            

https://github.com/iohelder/openstitch 

reddit.com
u/heldernoid — 8 hours ago
▲ 4 r/ollama+1 crossposts

how are you guys running mlx-community/gemma-4-31b-8bit on Mac?

mlx-lm? lmx-vlm? i'm having a lot of trouble getting it to run and then getting it to work properly. i sent a quick test using curl and it answered me correctly on the first try, but the 2nd time when i used curl with a different prompt, instead of giving me a 'correct' response, it just started spewing out random prompts.

Gemini thinks it has something to do with the chat template?

all i'm trying to do is manually benchmark the 3 variants that I have on my 64GB m1 max:

  • Gemma 4 Q4 GGUF: Unsloth
  • Gemma 4 Q6 GGUF: Unsloth
  • Gemma 4 8-bit MLX: Unsloth, converted by MLX-community

I want to test the speed and quality of each to see if MLX is worth keeping for its speed at the cost of "quality"

reddit.com
u/PinkySwearNotABot — 6 hours ago
▲ 7 r/ollama+1 crossposts

Gemma4 Web Search Hang

I have no problem using openwebui with selfhosted searxng as web search, selfhosted firecrawl as web loader and gpt-oss either 20b or 120b. However it does not work with Gemma4 either 26b or 31b, still can search, load but stop at "Retrieved ... sources", nothing more, nothing shown in log. Have anyone experienced this?

reddit.com
u/Top_Ad_5318 — 10 hours ago
▲ 16 r/ollama

Different between RAM and VRAM

So I have a system that has 64GB RAM and a NVIDIA GeForce RTX 3080 Ti. I am confused about RAM and VRAM. I see that the GPU I have has a VRAM of 12GB.

I wanted to give Gemma4:31b a try but I see that it is 20GB in size. I am a noob so forgive me but can’t that loads into RAM instead of the GPU? Also based on my config, any good agentic coding models you can suggest? I know im not going to get the same as what Claude offers

reddit.com
u/ConclusionUnique3963 — 18 hours ago
looking for beta testers - iOS AI chat app that is private, using ollama cloud open source models with ONLY local storage, tool use, web search, mark down etc.
▲ 11 r/ollama+1 crossposts

looking for beta testers - iOS AI chat app that is private, using ollama cloud open source models with ONLY local storage, tool use, web search, mark down etc.

I'm a big fan of open source models, but I don't have the hardware to run them myself, ollama cloud solves that problem, which is great but I wanted to use them on my phone too.

So i made this app, with all the things like markdown, web search, document upload, image upload etc.

It’s called Poly_Chat.

key elements,

  • switch between models easily
  • actually compare outputs
  • proper chat that keeps context
  • send PDFs / images
  • web search + simple tools
  • clean markdown (code actually looks right)
  • privacy and local storage
  • no one processing my data in the background
  • works with the free ollama cloud tier

The main thing for me:

Your chats stay on your phone.
No tracking, no analytics, you know where your data goes.

I didn’t build this as a startup idea or anything — I just wanted something that didn’t annoy me.

Figured other people might want the same thing.

Would love some people to test it, tell me any bug and any ideas for things to make it better!

https://preview.redd.it/z57xd5ymm4tg1.png?width=978&format=png&auto=webp&s=f7f4c42c715f664458402ded47fd7d93afb3e4b2

I plan to make this project open source once i clean up my code a bit...

https://testflight.apple.com/join/VVxXx65X

reddit.com
u/Left-Cauliflower-235 — 20 hours ago
▲ 3 r/ollama

Openclaw + Ollama on Pi5? well...

Guys, really need your help here.

I got pi5 with 8gb ram. it works perfectly with cloud models and also locally with "ollama run llama3.2:1b", but when i try to make it work via openclaw its "thinking" forever without reply.

It seems like its something with the openclaw stuff, otherwise it won't work directly with ollama...

any advice?

reddit.com
u/ParaPilot8 — 8 hours ago
Image 1 — Day 2 of building in public: finally added a second brain UI for my agent (+ a pixel 3D office)
Image 2 — Day 2 of building in public: finally added a second brain UI for my agent (+ a pixel 3D office)
Image 3 — Day 2 of building in public: finally added a second brain UI for my agent (+ a pixel 3D office)
Image 4 — Day 2 of building in public: finally added a second brain UI for my agent (+ a pixel 3D office)
▲ 3 r/buildinpublic+2 crossposts

Day 2 of building in public: finally added a second brain UI for my agent (+ a pixel 3D office)

i made these changes on nanobot today and this is one of those days where nothing looks different from outside, but internally everything shifted. until now, everything was happening through telegram and that’s still the main interface, but i kept hitting this gap once the conversation ends, everything kind of disappears. there’s no persistent layer where you can actually see what the agent is doing over time. so instead of trying to fix my behavior again, i built something around it.

what i wanted was simple: a second brain that keeps track of everything without me needing to constantly ask.

so now there’s a web UI that sits alongside telegram, not replacing it, just making everything visible and structured.

here’s what’s in place right now:

  • dashboard → shared workspace where tasks live, i can add things and the agent can pick them up and execute
  • recent activity → probably the most important part, shows what the agent is actually doing (completed tasks, generated docs, notes, assignments) and you can open anything to see full details
  • cron job viewer → all scheduled jobs in one place, what’s active, paused, when it runs next (this used to be completely invisible)
  • channel/auth layer → users can connect and configure things from the web instead of doing everything manually
  • pixel 3D office (first person view) → yeah this is experimental, but you can actually walk inside a workspace with agents, desks, screens… models are still basic but structure is there

telegram is still where everything starts commands, quick checks, fast interaction. the web UI is just the layer where everything lands and stays. so now it’s more like:
telegram → input
agent → runs / executes
web UI → shows state (second brain)

today was just frontend. nothing is wired to the actual backend yet, so it’s all static for now. later tonight i’ll start integrating this with nanobot’s codebase so the activity, tasks, cron, everything becomes real.

this is starting to feel less like a bot and more like a system that just keeps running in the background whether i’m there or not. still rough, still early, but yeah… day 2.

curious what you guys think about this direction

if you guys wanna take a look at UI only here it is : second-brain

u/Fine_Factor_456 — 10 hours ago
▲ 1 r/ollama

Found a bug on the way of instlling...

Anyone with similar bug?

 Mas o motivo real de estarmos nessa etapa é que o Ollama v0.20.2 tem um bug

de Metal shaders no macOS Tahoe 26.3.1 — os tipos bfloat/half não compilam

nos shaders do MetalPerformancePrimitives. Nenhum modelo carrega via    

Ollama nessa versão do macOS. Por isso troquei para o MLX-LM (framework    

nativo da Apple para inferência em Apple Silicon), que tem sua própria     

implementação Metal e deve funcionar.                                      

Resumo do caminho até aqui:

  1. Ollama instalado (OK)                                                   

  2. gemma4:e4b baixado (9.6GB, OK)                                          

  3. Modelo não carrega — crash no Metal shader (bug Ollama + Tahoe)         

  4. Tentei CPU mode — mesmo crash (Ollama compila Metal shaders de qualquer

 

forma)

  1. Mudei para MLX-LM — instalou OK, agora baixando modelo do HuggingFace

Uma vez que o modelo MLX baixar, a inferência em si deve ser rápida (~50-80

tok/s no M5).

reddit.com
u/AlanHelu — 6 hours ago
▲ 2 r/LocalLLaMA+1 crossposts

Best coding agent + model for strix halo 128 machine

I recently got my hands on a strix halo machine, I was very excited to test my coding project. My key stack is nextjs and python for most part, I tried qwen3-next-coder at 4bit quantization with 64k context with open code, but I kept running into failed tool calling loop for writing the file every time the context was at 20k.

Is that what people are experiencing? Is there a better way to do local coding agent?

reddit.com
u/Fireforce008 — 10 hours ago
Looking for Community help testing/breaking/improving a memory integrated Ai hub
▲ 3 r/ollama+1 crossposts

Looking for Community help testing/breaking/improving a memory integrated Ai hub

I was going to use Ai to write this post but I thought would be best to write it myself, so forgive my spelling and grammar mistakes 😬.

I’ve been fixated on Ai memory for the past few years, after countless failed attempts and rag reskins I finally designed something new “Viidnessmem and Mimir” (you may have seen my post about Mimir a few weeks ago).

I wanted to make somewhere that’s simple to use, completely free and local for anyone to use without the hassle of figuring out how to set up my system, this lead to Mimirs Memory Hub, a open sourced fully local ai agent hub designed to work with any existing framework you may already use (Ollama, Vllm, APIs, local gguf with llama.cpp, and more), the aim of this hub is to bring opensource ai to everyone with a community driven project “built for the community, by the community”. I'm currently looking for anyone who’d be interested in testing/breaking/improving this hub.

Now, for anyone still reading that's interested in the technical side, here's a brief overview of what makes Mimir's Memory Hub different:

The Memory System (Mimir)

Memory isn't a vector database dump. Every memory has 34 fields including emotion, importance, stability, encoding mood, novelty score, narrative arc position, drift history, and more.

Memory lifecycle:

  1. Encoding: new memories are scored for novelty (compared to last 20 memories), deduplicated (Jaccard ≥ 0.55 = merge), checked for flashbulb conditions, and indexed in both a BM25 inverted index and a semantic embedding index
  2. Consolidation: Huginn (pattern detection) runs every ~15 memories, Muninn (merge/prune/strengthen) runs periodically, gist compression kicks in after 90 days
  3. Recall: 5-stage hybrid retrieval: BM25 keyword → semantic search → spreading activation through the memory graph → mood-congruent filtering → composite reranking
  4. Decay: exponential decay based on spaced-repetition stability. Each time a memory is accessed with sufficient spacing (≥12 hours), stability grows by ×1.8 with diminishing returns. Cap at 180 days
  5. Death: memories below 0.01 vividness are archived to the "attic" (recoverable, not deleted)

Special memory types:

  • Flashbulb: high arousal (≥0.6) + high importance (≥8) = locked in with 120-day stability floor and 85% minimum vividness. Like how you remember exactly where you were on 9/11
  • Anchored: identity-level foundational memories. 90-day stability floor, 30% vividness floor. Never fully fade
  • Cherished: sentimental favourites, decay-resistant
  • Gist: after 90 days, non-protected memories compress to first 15 words

Retrieval scoring weights:

  • 30% BM25 keyword match
  • 30% semantic similarity (all-MiniLM-L6-v2, 384-dim vectors)
  • 20% vividness (decayed importance)
  • 10% mood congruence (you recall happy memories when happy)
  • 10% recency (5-day half-life)
  • Plus bonuses for cherished (×1.1), temporal relevance, visual memories, primed memories, spreading activation discoveries

Other systems like Rag/Letta/Mem0 ect are planned to be added as standalone systems or additional memory, but currently Mimir is the default.

Neurochemistry Engine (5 Neurotransmitters)

Real-time simulation of 5 chemicals that actually affect behaviour:

Chemical Baseline Decay Rate What It Controls
Dopamine 0.50 Fast (20min) Memory encoding strength (±30% importance)
Cortisol 0.30 Slow (46min) Attention width, flashbulb triggering (>0.70), Yerkes-Dodson performance curve
Serotonin 0.60 Very slow (69min) Mood stability — low serotonin = moods stick, high = moods pass quickly
Oxytocin 0.40 Moderate (35min) Social memory encoding boost (up to +40%)
Norepinephrine 0.50 Fastest (17min) Alert attention — high NE = more focused, low NE = better consolidation

10 event types trigger specific chemical profiles: surprise_positive, surprise_negative, conflict, warmth, novelty, resolution, achievement, loss, humor, stress.

Mood System (PAD Model)

42 emotion labels mapped to 3D vectors: Pleasure-Arousal-Dominance. Mood updates via exponential moving average (α = 0.3 × serotonin-adjusted decay). Real-time tracking with persistent mood history and trajectory analysis (improving/declining/stable, variability detection, breakthrough patterns).

Mood-reactive UI: 46 emotions mapped to HSL accent colors. The entire UI shifts color smoothly in real-time as the AI's mood changes.

Presets & How They Use Memory

Mimir's Memory Hub comes with 6 preset modes, each designed to get the most out of Mimir for those use cases.

Preset Memory Focus Chemistry Key Tags
Companion Emotional bonds, social impressions, cherished moments ✅ On <remember><cherish><social>&lt;remind&gt;
Agent Tasks, solutions, lessons learned, artifacts Off &lt;task&gt;<solution>&lt;remind&gt;
Character Full emotional range, narrative arcs, dreaming ✅ On <remember><cherish>, all emotion tags
Writer Story tracking, chapters, characters, world rules ✅ On <remember>&lt;task&gt;, creative memory
Assistant Appointments, notes, files, daily planning Off &lt;task&gt;&lt;remind&gt;<solution>
Custom User-configured ✅ On All available

Companion uses high emotion weight (0.8), social priority, and neurochemistry to build genuine relationships. Tracks people you mention, remembers feelings, cherishes meaningful moments.

Agent uses low emotion weight (0.2), task priority, 21 tools (file r/W, shell, code execution, web search, HTTP requests, screenshots, clipboard, etc.), and solution pattern matching. Learns from past failures via the Zeigarnik-boosted lesson system.

Character maxes emotion weight (1.0) for full immersive roleplay. The AI's mood genuinely influences responses, chemistry creates real emotional dynamics, and the rage quit mechanic means sustained negativity causes the AI to walk out.

Writer balances creativity (0.5 emotion) with project tracking. Remembers your story's characters, plot threads, chapters completed, world rules, and writing style.

Assistant is pure utility (0.15 emotion) with full tool access for appointments, reminders, file management, and daily planning.

Platform Features

10 LLM backends: Ollama, OpenAI, Anthropic, Google, OpenRouter, vLLM, OpenAI-Compatible, Custom, Local GGUF (llama-cpp-python), HuggingFace Transformers (SafeTensors GPU)

21 tools for Agent/Assistant: file read/write/search/grep, web search (DuckDuckGo or SearXNG), fetch pages, HTTP requests, shell exec, Python code execution, screenshot, clipboard, system info, diff, PDF read, CSV query, regex replace, weather, date/time, JSON parse, open apps

MCP support: Model Context Protocol with stdio and SSE transports. Auto-discovers tools from connected servers.

Vision: VL model detection (llava, moondream, qwen-vl, etc.), mmproj/CLIP for GGUF models, BLIP fallback text description for non-vision models

TTS: Edge TTS (free, many voices), HuggingFace Maya1 (GPU local), llama-server GGUF. Per-agent voice override. Browser SpeechSynthesis fallback.

STT: faster-whisper with push-to-hold mic button. Model sizes from tiny to large-v3.

Multi-agent chat: Multiple agents in one conversation. Three turn modes (address by name, sequential, all respond). Three view modes (combined, tabs, columns).

Character/Agent editor: Full creation interface + SillyTavern character card import (single or bulk). Per-agent model, backend, voice, and preset override. Isolated memory per agent.

8 visualizations: Yggdrasil graph, memory landscape, mood timeline, cherished wall, neurochemistry chart, relationships graph, topic clusters, memory attic.

See repo: Kronic90/Mimirs-Memory-Hub: Mimir's Memory Hub - multi-agent AI chat with persistent memory and SillyTavern compatibility for more info.

u/Upper-Promotion8574 — 15 hours ago
▲ 2 r/ollama

[Ollama Cloud] - Qwen3.5 / Minimax 2.7 / Deepseek 3.1,3.2

I'm using Antigravity with Ultra and Opus 4.6 exclusively.
It's now a joke for almost 300$ after few prompt you need to wait hours.
Need to find a full replacement of AG.
So I'm now testing Opencode using Qwen3.5:397b and Minimax (but buggy sometimes).
Did someone used Roocode / Kilocode with which model / structure ?
I heard architect mode for Kilocode seems powerful.

reddit.com
u/Hamzo-kun — 14 hours ago
▲ 2 r/ollama+1 crossposts

Hermes-agent -- What is this message about?

I recently tested Hermes Agent using gemma4:26b and I am incredibly impressed with the results; specifically, its ability to handle autonomous coding tasks with minimal prompting.

That said, I am encountering a recurring message:

>"Reasoning-only response looks like implicit context pressure — attempting compression"

I am confused as to why this is occurring given my hardware configuration. I have 32GB of VRAM (2x16GB), and `nvtop` shows only ~23GB in use. Additionally, the Ollama runner is only consuming 3.5GB of system RAM.

Why would the system report "context pressure" when there is clearly available VRAM?

reddit.com
u/Turbulent-Carpet-528 — 15 hours ago
▲ 1 r/ollama+1 crossposts

Dedicated EPYC servers for Ollama — real CPU inference benchmarks on CCX33 through CCX63

Running a managed Ollama deployment service (NestAI). Just shipped dedicated AMD EPYC CCX tiers. Sharing what each tier actually gives you for inference since I couldn't find good benchmarks for Hetzner CCX + Ollama anywhere.

Hardware is Hetzner CCX (EPYC Milan/Genoa dedicated vCPU):

CCX33 (8 dedicated vCPU, 32GB RAM) — +$29/mo:

  • Mistral 7B: ~12-15 tok/s
  • DeepSeek R1 14B: ~5-7 tok/s
  • Qwen 2.5 32B Q4: fits but slow, ~3-4 tok/s

CCX43 (16 dedicated vCPU, 64GB RAM) — +$59/mo:

  • Mistral 7B: ~15-18 tok/s
  • Phi-4 14B: ~7-10 tok/s
  • DeepSeek R1 32B: ~5-7 tok/s
  • Llama 3.3 70B Q4: fits, ~2-3 tok/s

CCX53 (32 dedicated vCPU, 128GB RAM) — +$119/mo:

  • 7B models: ~20+ tok/s
  • 32B models: ~8-10 tok/s
  • 70B models: ~3-5 tok/s
  • Can load multiple models simultaneously

CCX63 (48 dedicated vCPU, 192GB RAM) — +$179/mo:

  • Can run 70B + 7B simultaneously
  • Best case 70B: ~4-6 tok/s
  • Enough RAM for multiple 32B models loaded at once

All running Ollama latest with OLLAMA_FLASH_ATTENTION=1, OLLAMA_KEEP_ALIVE=-1, Q4_K_M quantization default.

The big difference from shared vCPU isn't peak speed — it's consistency. Shared CX43 can spike to 15 tok/s at 3 AM and drop to 6 tok/s at peak hours. Dedicated stays flat.

Still CPU, not GPU. If someone asks "why not just get an A4000" — you're absolutely right for raw performance. But for teams that need data residency guarantees (EU/GDPR, Singapore/PDPA) and can't ship data to a GPU cloud provider, dedicated CPU in a specific Hetzner datacenter is the tradeoff.

These tiers are add-ons on top of NestAI's managed plans ($39-299/mo). The managed part handles provisioning, Open WebUI, SSL, monitoring, team auth.

reddit.com
u/chiruwonder — 11 hours ago
tried running qwen 3.5:9b, it seems to have no memory or output? confused if i'm using it wrong
▲ 2 r/ollama

tried running qwen 3.5:9b, it seems to have no memory or output? confused if i'm using it wrong

i'm trying to use ollama with qwen 3.5:9b on an m1 pro chip macbook pro - canirun.ai says it should run well

everytime I ask a question it goes into a thinking loop for several mins, and when that's done, it gives no output. and another thing is it seems to have no memory of the conversation. every new question seems to be treated as a separate session with 0 context. is that because of this particular model? or is it just an ollama thing? is there another model that would be recommended? for reference, i'm trying to do a personal finance analysis by feeding it my bank and credit card statements

i'm new to ollama and trying to use open source models so I have no clue about how any of this works tbh so any help would be appreciated.

u/Fickle_Currency_478 — 18 hours ago
Zora - Your Ai Co Worker
▲ 0 r/ollama

Zora - Your Ai Co Worker

So I've been building something for the last few months and I've finally open-sourced it.

It's called Zora, basically Jarvis, but it runs on your own hardware. No cloud, no subscriptions, no data leaving your machine (unless you use plan mode) which can use codex if you wish.

She runs a custom trained AI model on Apple Silicon, handles my emails, WhatsApp, Teams, triages my inbox, preps me before meetings with talking points about the people I'm meeting, tracks my commitments, monitors my infrastructure, and even works overnight while I sleep.

The brain fits on a 16GB Mac Mini with headroom. I built a custom Metal GPU kernel for 3-bit KV cache compression to make that possible. She has 150+ tools, learns how I talk to different people, and drafts replies with my tone.

Additionally, you can add compute resource using the node functionality with unlimited nodes/compute potential. This is handled all through the orchestrator layer.

She also has her own 3D office that she decorates herself. Plants grow over time. She picks her own pet. It's the little things.

It's still early, and there are sharp edges, but it's real and it works. Built with MLX, FastAPI, and a lot of late nights.

You can even enjoy Claude code running a local model though Zora on MLX using the build in API.

If you've got a Mac and you're into AI/self-hosting, give it a go. Or just have a look at the README.

It's free, open source and always will be.

https://github.com/Azkabanned/Zora

Would love to hear what people think. Contributions welcome.

u/Covert-Agenda — 14 hours ago
Week