This sub has become a cesspool of vibecoded slop
We need a bot that automatically rejects any post that begins with "I built a..."
We need a bot that automatically rejects any post that begins with "I built a..."
Hey everyone,
I wanted to share a project I've been working on called Glia. It is a 100% offline, local-first RAG and memory layer designed to connect your AI web chats (Claude, ChatGPT, DeepSeek) with your local developer tools (Claude Code, Cursor, Windsurf) using a unified local database.
I wanted something lightweight that did not require pulling heavy Docker containers or subscribing to third-party memory APIs. I settled on a Node.js + SQLite architecture running sqlite-vec (for 768-dim float32 embeddings) alongside SQLite FTS5 for hybrid search, powered completely by local Ollama instances.
We just launched a live website that outlines the details and demonstrates the features in action:
Technical Stack & Features:
The extension works on Claude.ai, ChatGPT, DeepSeek, Gemini, Grok, and Mistral. The MCP server runs out of the same backend database for your terminal agent or Cursor.
You can set it up with a single command: npx glia-ai-setup
Glia is completely open-source (MIT). If you like the local-first approach or want to contribute to the SQLite vector pipeline, PRs are very welcome, and a star on GitHub helps the project get discovered!
I would appreciate any feedback on the SQLite hybrid search scaling, the scoring fusion algorithm (RAG pipeline details are in RAG_PIPELINE.md), or local graph extraction performance!
Every AI coding agent I've used treats security as a permission prompt: "allow this bash command? y/N". That's fine for catching rm -rf / mid-agent. It does nothing about the prompt that just got built from your repo and is about to ship a .env value, a private key, or a customer ID to api.anthropic.com.
So I wrote gnoma, a coding agent in Go where security isn't a permission UI — it's a layer the rest of the code can't bypass.
Architecture, top to bottom:
SafeProvider. There is one code path from gnoma's internals to any LLM endpoint, and it goes through a scanner that runs regex patterns (AWS keys, GCP service accounts, Stripe, GitHub PATs, private-key PEMs, etc.) plus a Shannon-entropy detector on the outgoing message and system prompt. Hits are redacted, blocked, or warned per config — before the network call.git diff that surfaces a private key, a cat .env, a curl response — all scanned before the LLM ever sees them. Same scanner, opposite direction.plugin.json SHA-256-pinned on first load. Manifest changes on disk = plugin refuses to load. SSH host-key discipline, applied to LLM tooling. No opt-out.EvalSymlinks errors, so the caller skips the symlink check, so the write proceeds through a symlinked parent and lands outside the workspace" — gets defeated by walking back to an existing ancestor, resolving it, then rejoining the tail.default, accept_edits, bypass, plan, deny, auto). Deny rules fire before any mode check, including bypass. Compound commands like echo ok && rm -rf / are split with a proper POSIX shell parser, so an rm -rf deny isn't smuggled past in a && chain.Ctrl+X toggles a mode where the session isn't persisted, the router doesn't learn from the turn, and there's no on-disk trace of the conversation.What it actually is, beyond the security layer:
A provider-agnostic coding agent. Multi-armed bandit router across whatever providers you have configured — cloud or local. A tiny SLM (≤1B, on Ollama / llama.cpp / llamafile) classifies every prompt and handles the trivial ones itself so the heavy model only runs on real work. MCP servers, skills, hooks, plugins. One static Go binary, CGO_ENABLED=0, no Node/Python runtime.
What it doesn't do:
curl, that's still on you.plugin.json, not the binaries it references. Treat the plugin directory itself as a filesystem-permissions trust boundary.Install:
# pre-built binary (linux / macos / windows × amd64 / arm64)
# grab the archive for your platform:
https://github.com/VikingOwl91/gnoma/releases
# go install
go install somegit.dev/Owlibou/gnoma/cmd/gnoma@latest
# docker (multi-arch)
docker pull ghcr.io/vikingowl91/gnoma:latest
docker run --rm -it -v "$PWD:/workspace" ghcr.io/vikingowl91/gnoma:latest
# from source
git clone https://github.com/VikingOwl91/gnoma && cd gnoma && make build
Point at any OpenAI-compatible endpoint:
gnoma
gnoma --provider ollama --model qwen2.5-coder:3b
gnoma --provider llamacpp # uses whatever your llama-server reports
Apache-2.0. Source: https://github.com/VikingOwl91/gnoma
Happy to go deep on the firewall design, the TOFU threat model, or the path canonicalization edge cases.
Currently using OpenClaw with Claude Opus 4.7 for browser automation workflows — pulling listings, researching properties, drafting documents, running multi-step agent tasks. Paying $280/month between Claude and Codex subscriptions.
Seriously considering a Mac Studio M4 Ultra 192GB to run local AI and cut that bill down. From everything I've read, the best local setup gets you to roughly 85% of cloud quality.
My main questions for anyone who's actually run both side by side:
Not a developer, more of a power user running automated real estate workflows. Privacy is a plus but mainly trying to figure out if the quality drop is something I'd feel constantly or just on edge cases.
I use Claude Code via Ollama to manipulate files and folders on my MacBook.
I’ve tried smaller models like Gemma 4 and Qwen 2.5 Coder in 7B, but they don’t work well (or maybe I just don’t know how to use them properly).
I’ve also tried larger 14B models, such as Qwen2.5‑Code‑14B, but when I run a prompt, my MacBook slows down a lot, sometimes freezes for a few seconds, and I have to wait several minutes. I was wondering if this is normal.
I genuinely feel like local AI is being massively underestimated right now.
Not because the models are bad anymore, but because the experience around them is still too technical for most people. Cloud AI dominates mostly because it’s simple: you open an app and it just works.
But local AI already has huge advantages in privacy, ownership and long-term cost, and hardware keeps getting better every year. That’s why I honestly think the future is hybrid AI: local by default, cloud only when needed.
So I started building a project called Euler around this idea. The goal is to make local AI feel as seamless as using ChatGPT — your own AI node running at home, accessible from any device, with optional cloud fallback when you need more power.
Still early, but I really think local AI is missing its “ChatGPT moment” in terms of usability.
So I need to know: would you actually use something like this? Or am I building this for no one?
I’d love to know cause I’ve been expending a lot of time on this.
I have questions regarding the repo as I want to know how to get it to be seen and people tell me real feedbacks and maybe it’s usefull for someone and can build on it and make a better thing
thanks for ur time :)
https://github.com/Hash-7777/HashCortX
(be positive pls as I got negative feedbacks at first as I didn’t know to push and comit as I work and kept local saves and then pushed one commit at once when I finished the whole thing)
A folder of .md files is not memory.
It’s a storage dump.
Useful AI memory needs more than “search old notes and pray”:
- semantic recall, so related ideas surface even when wording differs
- entities, different terms for the same thing don’t become random blobs
- relationships, so the system knows how things connect
- provenance, so it can trace where facts came from
- correction + forgetting, because stale memory is worse than no memory
- background consolidation, because raw chat logs are mostly sludge
Thoth uses a local personal knowledge graph + FAISS semantic search + graph expansion + document ingestion + wiki export.
So yes, you can still get readable notes.
But underneath, the assistant isn’t just rifling through markdown like a raccoon in a filing cabinet.
It’s building structured personal context it can retrieve, update, connect, and reason over.
That’s the difference between “I saved your notes” and “I actually know what matters.”
Relevant references:
FAISS docs: efficient similarity search and clustering of dense vectors.
Microsoft GraphRAG: combines text extraction, network analysis, LLM prompting, and summarisation for richer understanding of text datasets.
GraphRAG survey on arXiv: graphs encode heterogeneous and relational information, making them useful for retrieval-augmented generation.
Thoth README memory features: personal knowledge graph, typed relations, FAISS semantic recall, graph expansion, document extraction, wiki export, Dream Cycle refinement.
Hi, I'm new to Open WebUI, I installed Open WebUI in Docker, it works well.
However, upon installation it asked me to install Ollama locally, I skipped it. Can I still install it now, afterwards? I want to use some local LLMs.
I tried to search for the answer myself but couldn't find it.
If you run multiple models in the same session, be it a coding LLM, a reasoning LLM, different ComfyUI checkpoints depending on what you're generating, you already know the problem. Every swap loads gigabytes off disk. Fast NVMe makes it bearable. SATA or spinning rust makes it genuinely painful. And Windows will evict those file cache pages whenever something else needs memory, so you can't count on the OS keeping them warm for you.
I wrote a Windows app called EWE (Extended Weights Exchanger) that addresses this directly. You add your models to a "warm map," set a RAM budget, and EWE pins the weights using Windows memory APIs so they can't be evicted. The next time any application loads that model, it reads from RAM instead of going back to disk. On my setup, swaps that were taking 60-90 seconds now take under 5 seconds.
It's not magic - you need enough system RAM to hold what you want to keep warm. But if you have spare RAM sitting idle while you work, this is a pretty direct use for it.
The app is at https://accord-gpu.com/ewe/ if you want to look at what it does. Currently collecting free early access accounts and enrollments for beta access to the products I'm building. EWE is going to be a one-time purchase (no subscription), and I want to get real users on it before setting the price.
A few things I'm genuinely curious about from this community:
Honest feedback is more useful than encouragement here. If this solves a problem you don't actually have I'd rather know now.
I'm passionate about local LLMs and self-learning AI. I've always wondered: why can't an AI agent work like a human? Have a local brain; when asked, think first; if unsure, ask someone smarter (a cloud model, or search); then learn from the answer so next time you don't need to ask.
I have been trying to build autodidact, an open-source AI agent that learns from its cloud queries - the local model handles what it knows, escalates to a cloud model when uncertain, then distills the response into permanent local memory. Next similar query gets answered locally, for free.
In a 30-query session on my dev workload: 67% local-or-memory, $0.70 saved vs an all-cloud baseline. The more you use it, the cheaper and faster it gets.
What's in v1.0:
• Confidence-based routing (logprob_uncertainty + GSA pre-screen + refusal detection). Validated AUROC 0.65–0.83 across 3 model families × 2 datasets.
• Hybrid retrieval: BM25 (FTS5) + vector (FAISS), fused via Reciprocal Rank Fusion.
• Document synthesis - `autodidact learn <path>` extracts key facts in the background, not just chunks.
• Five setup modes: Local+Cloud (default), Cloud+Cloud (no GPU), Local+Local (offline learning), custom OpenAI-compatible server, Local-only.
• All state in one portable SQLite file.
The routing layer is grounded in a paper I published recently (https://arxiv.org/abs/2605.02241) - average token log-probability matches or beats trained routing classifiers (RouteLLM-style) at zero per-model training cost, and transfers across query distributions where supervised baselines collapse.
What's NOT in v1.0 (designed and scoped):
• Tool execution / ReAct loop (v2.0)
• Skill extraction — only fact extraction so far (v2.0)
• MCP server for Claude Desktop / Cursor / Gemini CLI (v2.0)
• OpenAI-compatible proxy mode (`autodidact serve` — v1.5)
• Topic-based knowledge pages instead of flat facts (v1.5)
Looking for early adopters and contributors - especially anyone with opinions on local LLM routing, RAG retrieval pipelines, or the v2.0 agent surface. What I'd love feedback on:
• Is the routing decision (logprob + GSA + refusal detection) the
right combo, or am I overweighting one signal?
• How would you structure the skill-extraction step in v2.0 - extract
procedures from cloud responses, or learn from observed task
completions?
• What's missing from the "good first issues" list for someone wanting
to contribute?
Repo: https://github.com/BuffaloTechRider/Autodidact
Install: pip install autodidact
Quickstart: autodidact init && autodidact learn <code or document path> && autodidact chat
Happy to answer questions.
I’m creating an edge device that will use a llama and rag that will have a database of knowledge on a general niche, off grid survival. I also have a couple other things I want to implement into the application/UI, including off-line maps and mesh tastic service. Should I custom code my own UI or use open web UI?
I'm kinda new to this world of running LLM's locally and ollama and stuff so maybe my terminologies might not be spot on.
And for my first project (a project that I'll keep working on for a long time to make it better and better) I'm making a voice assistant (with tools) but I'm kinda stuck at choosing an LLM. i can't use a model with more than 8b parameters cuz i have a 4050 (cuz a voice assistant needs to be fast). So far I've tried these models and had these problems with them:
Gemma4:e4b -> it loses context and starts behaving completely randomly sometimes, especially after exchanging a few dialogues. i guess it might be because of the context capabilities of the model.
qwen2.5:7b -> qwen models have very strict guardrails which hinder them from fully roleplaying a character (like billy butcher from the boys because of the language).
mistral:7b -> instead of calling a tool, it just leaks the json inside the response, and idk how to solve that. i thought of manually extracting the tool calls from the response but for that too I'll have to teach the model this in a system prompt to call tools in a defined way. Is there any other way of doing this or should i just do this manual extraction? also yeah, sometimes it was calling tools (in the response only) even when there was no need.
Hermes3:8b -> okay, this one's case is special... it completely ignores the system prompt, calls tools randomly, and sometimes calls them even when they are not required. I've heard that the model is pretty good in itself but it just isn't working.
I'm using Ollama's python library to communicate with the models. and for the chat history, I've set a limit on the messages array that deletes the oldest message when the array grows more than 10 entries (having assistant, tool, and user as separate entries). system prompt always remains at index 0.
please can you help me by telling me what all i need to learn or if I'm missing on basic concepts and how I can tackle these problems I'm facing.
Hey everyone! I threw together a lightweight local proxy that lets you use AgentRouter models directly in OpenCode.
It runs an OpenAI-compatible server, relays streaming requests to AgentRouter, and all you need is the opencode.jsonc from the repo to configure it.
Repo: https://github.com/Fares-Nosair/opencode-agentrouter-support
Open to feedback and contributions! 🚀
It seems to say “1M context” in the model information.
I ran it using the command `ollama launch claude --model deepseek-v4-pro:cloud`.
How many time you heard: "I can build anything, but no clients"? The 'build it and they will come' strategy is a trap. You need a rendezvous with a real-world bottleneck. 🥂 rundevoo . sbs is a marketplace where businesses post the actual problems they're willing to pay to solve. No gatekeeping, just pure bottlenecks waiting for a genius. Stop guessing what the market wants and just go find a problem that's already screaming for a solution."
so i am a basic ollama user, i just install the app, use open webui and thats it. my question is this; i am thinking of using deepseek r1 as planning model and qwen 3.6 35b unsloth one for coding in Cline in Vscode. since i have just a 5090 and 128gb system ram, instead of constantly offloading the model fully and reading from ssd, i though maybe i can use my ram as the storage, keep the models there, and load/unload models from/to ram instead?
I am not asking to use ram instead of vram. that is not what i am asking (which is also done automatically by ollama) i am just asking would it be possible to make ollama keep the UNUSED model in ram instead and how much speed would it give to me compared to an nvme ssd with 3-4GB/s read speed? are we talking about a few seconds which can be ignored or would it matteR?
Hey guys,
I wanted to share a side project I've been building on my laptop for the past few weeks. It's called HERO ZAN, and it's basically a fully offline, private AI assistant that can speak, listen, and see through the webcam without using any external APIs or cloud services.
I wanted something that supports Arabic natively, has a low latency, and doesn't melt my system resources. Here is the stack I ended up using to make it work:
Ollama as the backend for the LLM (I'm using qwen2.5-coder:7b since it handles Arabic really well and gives solid reasoning).
Faster-Whisper (medium model) for speech-to-text. It's surprisingly fast on local hardware.
Piper TTS for the voice output. Finding a good, natural-sounding local Arabic TTS was a pain, but Piper ONNX models did the trick.
Moondream (via Ollama) for the vision part. If you ask it "شايف إيه؟" (What do you see?), it grabs a frame from the webcam and describes it.
CustomTkinter for a simple GUI, featuring a small animated cartoon face that changes its expression depending on what the assistant is doing (thinking, listening, talking, etc.).
Everything runs locally on my machine (I'm currently testing it on a standard AMD Ryzen 5 Pro setup with 8GB RAM, and it runs smoothly without choking the system). It also has local chat history and an optional local web search via DuckDuckGo if needed.
The main reason I built this was to prove to myself that we don't need massive server farms or expensive API subscriptions to have a functional, multi-modal assistant that respects privacy 100%.
The code is fully open-source. If you want to check it out, run it locally, or contribute, here is the repo:
https://github.com/MHR-X/hero-zan
Let me know if you have any questions about the setup, the Piper TTS integration, or the performance!
I developed an LSP, VS-Code extension and NPM package, please try it out and give me your thoughts!