r/ollama

This sub has become a cesspool of vibecoded slop

We need a bot that automatically rejects any post that begins with "I built a..."

▲ 138 r/ollama+14 crossposts

Glia – Local-first shared memory layer (SQLite-vec + FTS5 + Offline Knowledge Graph)

Hey everyone,

I wanted to share a project I've been working on called Glia. It is a 100% offline, local-first RAG and memory layer designed to connect your AI web chats (Claude, ChatGPT, DeepSeek) with your local developer tools (Claude Code, Cursor, Windsurf) using a unified local database.

I wanted something lightweight that did not require pulling heavy Docker containers or subscribing to third-party memory APIs. I settled on a Node.js + SQLite architecture running sqlite-vec (for 768-dim float32 embeddings) alongside SQLite FTS5 for hybrid search, powered completely by local Ollama instances.

We just launched a live website that outlines the details and demonstrates the features in action:

Website: https://glia-ai.vercel.app/
Codebase: https://github.com/Eshaan-Nair/Glia-AI

Technical Stack & Features:

Hybrid Search Retrieval: SQLite-vec (using nomic-embed-text locally) + FTS5 keyword prefix matching (porter stemmer).
Surgical Sentence-level Trimming: Chunks are sliced into sentences. When a prompt is intercepted, only the exact matching sentences are pulled out of the vector store instead of the whole paragraph. It cuts LLM prompt bloat by ~90-95% in my benchmarks.
Knowledge Graph Extraction: An offline task queue uses a local LLM (llama3.1:8b via Ollama) to extract entity triples (subject-relation-object). These are stored in a SQLite facts table (or Neo4j if you run the full Docker compose profile) and fused with the vector retrieval score.
HyDE (Hypothetical Document Embeddings): Queries are pre-processed to generate a hypothetical answer, which is embedded together with the original query to bridge semantic gaps.
Concurrency: Running SQLite in WAL (Write-Ahead Logging) mode allows the browser extension dashboard and active MCP sessions to read/write concurrently without locking.
PII Redaction: Aggressive scrubbing of JWTs, API keys, emails, and IPs in the extension before data is saved.

The extension works on Claude.ai, ChatGPT, DeepSeek, Gemini, Grok, and Mistral. The MCP server runs out of the same backend database for your terminal agent or Cursor.

You can set it up with a single command: npx glia-ai-setup

Glia is completely open-source (MIT). If you like the local-first approach or want to contribute to the SQLite vector pipeline, PRs are very welcome, and a star on GitHub helps the project get discovered!

I would appreciate any feedback on the SQLite hybrid search scaling, the scoring fusion algorithm (RAG pipeline details are in RAG_PIPELINE.md), or local graph extraction performance!

u/Better-Platypus-3420 — 10 hours ago

▲ 8 r/ollama

I built a coding agent in Go that puts a secret-scanning firewall between your code and the LLM (works with Ollama too)

Every AI coding agent I've used treats security as a permission prompt: "allow this bash command? y/N". That's fine for catching rm -rf / mid-agent. It does nothing about the prompt that just got built from your repo and is about to ship a .env value, a private key, or a customer ID to api.anthropic.com.

So I wrote gnoma, a coding agent in Go where security isn't a permission UI — it's a layer the rest of the code can't bypass.

Architecture, top to bottom:

Outbound firewall on the provider boundary. Every provider — Anthropic, OpenAI, Gemini, Mistral, Ollama, llama.cpp — is wrapped in a SafeProvider. There is one code path from gnoma's internals to any LLM endpoint, and it goes through a scanner that runs regex patterns (AWS keys, GCP service accounts, Stripe, GitHub PATs, private-key PEMs, etc.) plus a Shannon-entropy detector on the outgoing message and system prompt. Hits are redacted, blocked, or warned per config — before the network call.
Tool-result redaction on the way back. A git diff that surfaces a private key, a cat .env, a curl response — all scanned before the LLM ever sees them. Same scanner, opposite direction.
TOFU plugin pinning. Plugins (which can ship hooks and MCP servers — i.e. arbitrary binaries running as you) get their plugin.json SHA-256-pinned on first load. Manifest changes on disk = plugin refuses to load. SSH host-key discipline, applied to LLM tooling. No opt-out.
TOCTOU-safe path canonicalization. The classic sandbox escape — "leaf doesn't exist, so EvalSymlinks errors, so the caller skips the symlink check, so the write proceeds through a symlinked parent and lands outside the workspace" — gets defeated by walking back to an existing ancestor, resolving it, then rejoining the tail.
Permission modes with deny rules that are bypass-immune. Six modes (default, accept_edits, bypass, plan, deny, auto). Deny rules fire before any mode check, including bypass. Compound commands like echo ok && rm -rf / are split with a proper POSIX shell parser, so an rm -rf deny isn't smuggled past in a && chain.
Incognito. Ctrl+X toggles a mode where the session isn't persisted, the router doesn't learn from the turn, and there's no on-disk trace of the conversation.

What it actually is, beyond the security layer:

A provider-agnostic coding agent. Multi-armed bandit router across whatever providers you have configured — cloud or local. A tiny SLM (≤1B, on Ollama / llama.cpp / llamafile) classifies every prompt and handles the trivial ones itself so the heavy model only runs on real work. MCP servers, skills, hooks, plugins. One static Go binary, CGO_ENABLED=0, no Node/Python runtime.

What it doesn't do:

Not a full network sandbox. The scanner is on the LLM provider boundary; if a tool you allowed shells out to curl, that's still on you.
The plugin pin covers plugin.json, not the binaries it references. Treat the plugin directory itself as a filesystem-permissions trust boundary.
No published benchmark numbers. The value prop is the architecture, not a score.

Install:

# pre-built binary (linux / macos / windows × amd64 / arm64)
# grab the archive for your platform:
https://github.com/VikingOwl91/gnoma/releases

# go install
go install somegit.dev/Owlibou/gnoma/cmd/gnoma@latest

# docker (multi-arch)
docker pull ghcr.io/vikingowl91/gnoma:latest
docker run --rm -it -v "$PWD:/workspace" ghcr.io/vikingowl91/gnoma:latest

# from source
git clone https://github.com/VikingOwl91/gnoma &amp;&amp; cd gnoma &amp;&amp; make build

Point at any OpenAI-compatible endpoint:

gnoma
gnoma --provider ollama   --model qwen2.5-coder:3b
gnoma --provider llamacpp                          # uses whatever your llama-server reports

Apache-2.0. Source: https://github.com/VikingOwl91/gnoma

Happy to go deep on the firewall design, the TOFU threat model, or the path canonicalization edge cases.

u/MrViking2k19 — 6 hours ago

▲ 15 r/ollama

Mac Studio Ultra 192GB for local AI — can you actually tell the difference vs Claude Opus for browser automation?

Currently using OpenClaw with Claude Opus 4.7 for browser automation workflows — pulling listings, researching properties, drafting documents, running multi-step agent tasks. Paying $280/month between Claude and Codex subscriptions.

Seriously considering a Mac Studio M4 Ultra 192GB to run local AI and cut that bill down. From everything I've read, the best local setup gets you to roughly 85% of cloud quality.

My main questions for anyone who's actually run both side by side:

For routine browser automation (multi-step tasks, form filling, research workflows) — is the gap noticeable day to day?
Where does local actually fall short vs Opus in your experience?
Is the 192GB worth the $7k or does the $3,999 128GB Studio cover most of the same ground?

Not a developer, more of a power user running automated real estate workflows. Privacy is a plus but mainly trying to figure out if the quality drop is something I'd feel constantly or just on edge cases.

reddit.com

u/Soft-Conference-9992 — 12 hours ago

▲ 6 r/ollama

Which AI model should I use on a MacBook Pro M4 Pro with 24 GB RAM?

I use Claude Code via Ollama to manipulate files and folders on my MacBook.

I’ve tried smaller models like Gemma 4 and Qwen 2.5 Coder in 7B, but they don’t work well (or maybe I just don’t know how to use them properly).

I’ve also tried larger 14B models, such as Qwen2.5‑Code‑14B, but when I run a prompt, my MacBook slows down a lot, sometimes freezes for a few seconds, and I have to wait several minutes. I was wondering if this is normal.

reddit.com

u/Resident-Cut5371 — 11 hours ago

▲ 1 r/ollama+1 crossposts

Do you think local AI is hard?

I genuinely feel like local AI is being massively underestimated right now.

Not because the models are bad anymore, but because the experience around them is still too technical for most people. Cloud AI dominates mostly because it’s simple: you open an app and it just works.

But local AI already has huge advantages in privacy, ownership and long-term cost, and hardware keeps getting better every year. That’s why I honestly think the future is hybrid AI: local by default, cloud only when needed.

So I started building a project called Euler around this idea. The goal is to make local AI feel as seamless as using ChatGPT — your own AI node running at home, accessible from any device, with optional cloud fallback when you need more power.

Still early, but I really think local AI is missing its “ChatGPT moment” in terms of usability.

So I need to know: would you actually use something like this? Or am I building this for no one?

I’d love to know cause I’ve been expending a lot of time on this.

eulertech.xyz

u/Due_Faithlessness458 — 10 hours ago

▲ 1 r/ollama+1 crossposts

I am a beginner Vibecoder and a pharmacist and I open sourced a huge 30 M token huge agentic app that actually works with oLLama and cloud models and I put alot of time and effort in it

I have questions regarding the repo as I want to know how to get it to be seen and people tell me real feedbacks and maybe it’s usefull for someone and can build on it and make a better thing
thanks for ur time :)

https://github.com/Hash-7777/HashCortX

(be positive pls as I got negative feedbacks at first as I didn’t know to push and comit as I work and kept local saves and then pushed one commit at once when I finished the whole thing)

u/SSSHash — 14 hours ago

▲ 2 r/ollama+3 crossposts

.md files are not Memory

A folder of .md files is not memory.

It’s a storage dump.

Useful AI memory needs more than “search old notes and pray”:

- semantic recall, so related ideas surface even when wording differs

- entities, different terms for the same thing don’t become random blobs

- relationships, so the system knows how things connect

- provenance, so it can trace where facts came from

- correction + forgetting, because stale memory is worse than no memory

- background consolidation, because raw chat logs are mostly sludge

Thoth uses a local personal knowledge graph + FAISS semantic search + graph expansion + document ingestion + wiki export.

So yes, you can still get readable notes.

But underneath, the assistant isn’t just rifling through markdown like a raccoon in a filing cabinet.

It’s building structured personal context it can retrieve, update, connect, and reason over.

That’s the difference between “I saved your notes” and “I actually know what matters.”

Relevant references:

FAISS docs: efficient similarity search and clustering of dense vectors.

https://faiss.ai/
Microsoft GraphRAG: combines text extraction, network analysis, LLM prompting, and summarisation for richer understanding of text datasets.

https://www.microsoft.com/en-us/research/project/graphrag/
GraphRAG survey on arXiv: graphs encode heterogeneous and relational information, making them useful for retrieval-augmented generation.

https://arxiv.org/abs/2501.00309
Thoth README memory features: personal knowledge graph, typed relations, FAISS semantic recall, graph expansion, document extraction, wiki export, Dream Cycle refinement.

https://github.com/siddsachar/Thoth

u/Acceptable-Object390 — 17 hours ago

▲ 4 r/ollama+1 crossposts

I installed Open WebUI, upon installation it asked me to install Ollama, I skipped it, can I still install it now afterwards? I want to use local LLMs

Hi, I'm new to Open WebUI, I installed Open WebUI in Docker, it works well.

However, upon installation it asked me to install Ollama locally, I skipped it. Can I still install it now, afterwards? I want to use some local LLMs.

I tried to search for the answer myself but couldn't find it.

reddit.com

u/sarrcom — 13 hours ago

▲ 4 r/ollama+2 crossposts

I built a Windows app that pins your model weights in RAM so you stop waiting for disk loads on every model swap - looking for feedback

If you run multiple models in the same session, be it a coding LLM, a reasoning LLM, different ComfyUI checkpoints depending on what you're generating, you already know the problem. Every swap loads gigabytes off disk. Fast NVMe makes it bearable. SATA or spinning rust makes it genuinely painful. And Windows will evict those file cache pages whenever something else needs memory, so you can't count on the OS keeping them warm for you.

I wrote a Windows app called EWE (Extended Weights Exchanger) that addresses this directly. You add your models to a "warm map," set a RAM budget, and EWE pins the weights using Windows memory APIs so they can't be evicted. The next time any application loads that model, it reads from RAM instead of going back to disk. On my setup, swaps that were taking 60-90 seconds now take under 5 seconds.

https://preview.redd.it/q6t7o1nqr42h1.png?width=900&format=png&auto=webp&s=bf4eae93cbb1254fb759a28410db9004d2b4d691

It's not magic - you need enough system RAM to hold what you want to keep warm. But if you have spare RAM sitting idle while you work, this is a pretty direct use for it.

The app is at https://accord-gpu.com/ewe/ if you want to look at what it does. Currently collecting free early access accounts and enrollments for beta access to the products I'm building. EWE is going to be a one-time purchase (no subscription), and I want to get real users on it before setting the price.

A few things I'm genuinely curious about from this community:

I wrote this for Ollama and ComfyUI specifically on my box. It reads the Ollama blob manifests and loads .gguf, .safetensors, .ckpt and .pth files so far. What other model formats should it support, and what other applications should I be checking against for compatibility?
Is this a workflow pain you actually have, or do most people just absorb the downtime between model uses?
Is there an obvious feature I'm missing?
What would a fair one-time price look like for something like this for a perpetual license?

Honest feedback is more useful than encouragement here. If this solves a problem you don't actually have I'd rather know now.

reddit.com

u/MrAddams_LibraLogic — 13 hours ago

▲ 5 r/ollama+5 crossposts

Dograh is trending on GitHub - Crossed 2000 Stars

u/Slight_Republic_4242 — 22 hours ago

▲ 1 r/ollama

Autodidact – Self-evolving local-first AI agent on top of Ollama

I'm passionate about local LLMs and self-learning AI. I've always wondered: why can't an AI agent work like a human? Have a local brain; when asked, think first; if unsure, ask someone smarter (a cloud model, or search); then learn from the answer so next time you don't need to ask.

I have been trying to build autodidact, an open-source AI agent that learns from its cloud queries - the local model handles what it knows, escalates to a cloud model when uncertain, then distills the response into permanent local memory. Next similar query gets answered locally, for free.

In a 30-query session on my dev workload: 67% local-or-memory, $0.70 saved vs an all-cloud baseline. The more you use it, the cheaper and faster it gets.

What's in v1.0:

• Confidence-based routing (logprob_uncertainty + GSA pre-screen + refusal detection). Validated AUROC 0.65–0.83 across 3 model families × 2 datasets.

• Hybrid retrieval: BM25 (FTS5) + vector (FAISS), fused via Reciprocal Rank Fusion.

• Document synthesis - `autodidact learn <path>` extracts key facts in the background, not just chunks.

• Five setup modes: Local+Cloud (default), Cloud+Cloud (no GPU), Local+Local (offline learning), custom OpenAI-compatible server, Local-only.

• All state in one portable SQLite file.

The routing layer is grounded in a paper I published recently (https://arxiv.org/abs/2605.02241) - average token log-probability matches or beats trained routing classifiers (RouteLLM-style) at zero per-model training cost, and transfers across query distributions where supervised baselines collapse.

What's NOT in v1.0 (designed and scoped):

• Tool execution / ReAct loop (v2.0)

• Skill extraction — only fact extraction so far (v2.0)

• MCP server for Claude Desktop / Cursor / Gemini CLI (v2.0)

• OpenAI-compatible proxy mode (`autodidact serve` — v1.5)

• Topic-based knowledge pages instead of flat facts (v1.5)

Looking for early adopters and contributors - especially anyone with opinions on local LLM routing, RAG retrieval pipelines, or the v2.0 agent surface. What I'd love feedback on:
• Is the routing decision (logprob + GSA + refusal detection) the
right combo, or am I overweighting one signal?
• How would you structure the skill-extraction step in v2.0 - extract
procedures from cloud responses, or learn from observed task
completions?
• What's missing from the "good first issues" list for someone wanting
to contribute?

Repo: https://github.com/BuffaloTechRider/Autodidact

Install: pip install autodidact

Quickstart: autodidact init && autodidact learn <code or document path> && autodidact chat

Happy to answer questions.

reddit.com

u/pavel6490 — 16 hours ago

▲ 1 r/ollama

Is it worth creating my own UI for the chat feature or can I accomplish my goals with open web UI?

I’m creating an edge device that will use a llama and rag that will have a database of knowledge on a general niche, off grid survival. I also have a couple other things I want to implement into the application/UI, including off-line maps and mesh tastic service. Should I custom code my own UI or use open web UI?

reddit.com

u/enan1000 — 17 hours ago

▲ 1 r/ollama

I kinda need help with the hermes3:8b model.

I'm kinda new to this world of running LLM's locally and ollama and stuff so maybe my terminologies might not be spot on.

And for my first project (a project that I'll keep working on for a long time to make it better and better) I'm making a voice assistant (with tools) but I'm kinda stuck at choosing an LLM. i can't use a model with more than 8b parameters cuz i have a 4050 (cuz a voice assistant needs to be fast). So far I've tried these models and had these problems with them:

Gemma4:e4b -> it loses context and starts behaving completely randomly sometimes, especially after exchanging a few dialogues. i guess it might be because of the context capabilities of the model.
qwen2.5:7b -> qwen models have very strict guardrails which hinder them from fully roleplaying a character (like billy butcher from the boys because of the language).
mistral:7b -> instead of calling a tool, it just leaks the json inside the response, and idk how to solve that. i thought of manually extracting the tool calls from the response but for that too I'll have to teach the model this in a system prompt to call tools in a defined way. Is there any other way of doing this or should i just do this manual extraction? also yeah, sometimes it was calling tools (in the response only) even when there was no need.
Hermes3:8b -> okay, this one's case is special... it completely ignores the system prompt, calls tools randomly, and sometimes calls them even when they are not required. I've heard that the model is pretty good in itself but it just isn't working.

I'm using Ollama's python library to communicate with the models. and for the chat history, I've set a limit on the messages array that deletes the oldest message when the array grows more than 10 entries (having assistant, tool, and user as separate entries). system prompt always remains at index 0.

please can you help me by telling me what all i need to learn or if I'm missing on basic concepts and how I can tackle these problems I'm facing.

reddit.com

u/ItsMeErr404 — 23 hours ago

▲ 21 r/ollama+3 crossposts

I made a tool to use AgentRouter models in OpenCode

Hey everyone! I threw together a lightweight local proxy that lets you use AgentRouter models directly in OpenCode.

It runs an OpenAI-compatible server, relays streaming requests to AgentRouter, and all you need is the opencode.jsonc from the repo to configure it.

Repo: https://github.com/Fares-Nosair/opencode-agentrouter-support

Open to feedback and contributions! 🚀

u/Impressive_Wave_2455 — 21 hours ago

▲ 4 r/ollama

I'm using a 1M context model in Ollama, but it shows as 200K in the Claude code. how can I change it to 1M?

https://preview.redd.it/2ckoerm0t22h1.png?width=1036&format=png&auto=webp&s=40f5f69d995f2c66df0c8716f9e9e5aab22e7055

It seems to say “1M context” in the model information.

https://preview.redd.it/oy9fapsct22h1.png?width=1658&format=png&auto=webp&s=1850263a429f8f4ab2c0f8e10646e66ae5fcdf98

I ran it using the command `ollama launch claude --model deepseek-v4-pro:cloud`.

reddit.com

u/TGoddessana — 21 hours ago

▲ 3 r/ollama+5 crossposts

Stop complaining that you do not have enough clients

How many time you heard: "I can build anything, but no clients"? The 'build it and they will come' strategy is a trap. You need a rendezvous with a real-world bottleneck. 🥂 rundevoo . sbs is a marketplace where businesses post the actual problems they're willing to pay to solve. No gatekeeping, just pure bottlenecks waiting for a genius. Stop guessing what the market wants and just go find a problem that's already screaming for a solution."

reddit.com

u/Honeydew-Stunning — 21 hours ago

▲ 2 r/ollama

newb question, keeping models in ram(as storage) instead of fully flushing

so i am a basic ollama user, i just install the app, use open webui and thats it. my question is this; i am thinking of using deepseek r1 as planning model and qwen 3.6 35b unsloth one for coding in Cline in Vscode. since i have just a 5090 and 128gb system ram, instead of constantly offloading the model fully and reading from ssd, i though maybe i can use my ram as the storage, keep the models there, and load/unload models from/to ram instead?

I am not asking to use ram instead of vram. that is not what i am asking (which is also done automatically by ollama) i am just asking would it be possible to make ollama keep the UNUSED model in ram instead and how much speed would it give to me compared to an nvme ssd with 3-4GB/s read speed? are we talking about a few seconds which can be ignored or would it matteR?

reddit.com

u/ares0027 — 19 hours ago

▲ 3 r/ollama

I built an offline multi-modal AI assistant (Voice + Vision) that runs locally on my laptop

Hey guys,

I wanted to share a side project I've been building on my laptop for the past few weeks. It's called HERO ZAN, and it's basically a fully offline, private AI assistant that can speak, listen, and see through the webcam without using any external APIs or cloud services.

I wanted something that supports Arabic natively, has a low latency, and doesn't melt my system resources. Here is the stack I ended up using to make it work:

Ollama as the backend for the LLM (I'm using qwen2.5-coder:7b since it handles Arabic really well and gives solid reasoning).

Faster-Whisper (medium model) for speech-to-text. It's surprisingly fast on local hardware.

Piper TTS for the voice output. Finding a good, natural-sounding local Arabic TTS was a pain, but Piper ONNX models did the trick.

Moondream (via Ollama) for the vision part. If you ask it "شايف إيه؟" (What do you see?), it grabs a frame from the webcam and describes it.

CustomTkinter for a simple GUI, featuring a small animated cartoon face that changes its expression depending on what the assistant is doing (thinking, listening, talking, etc.).

Everything runs locally on my machine (I'm currently testing it on a standard AMD Ryzen 5 Pro setup with 8GB RAM, and it runs smoothly without choking the system). It also has local chat history and an optional local web search via DuckDuckGo if needed.

The main reason I built this was to prove to myself that we don't need massive server farms or expensive API subscriptions to have a functional, multi-modal assistant that respects privacy 100%.

The code is fully open-source. If you want to check it out, run it locally, or contribute, here is the repo:

https://github.com/MHR-X/hero-zan

Let me know if you have any questions about the setup, the Piper TTS integration, or the performance!

reddit.com

u/MohamedHeroo — 23 hours ago

▲ 475 r/ollama+7 crossposts

BoneScript, a new opensource Compiler for complete backend development

I developed an LSP, VS-Code extension and NPM package, please try it out and give me your thoughts!

github.com

u/Glittering_Focus1538 — 2 days ago