
r/LocalLLM



We gave 12 LLMs a startup to run for a year. GLM-5 nearly matched Claude Opus 4.6 at 11× lower cost.
We built YC-Bench, a benchmark where an LLM plays CEO of a simulated startup over a full year (~hundreds of turns). It manages employees, picks contracts, handles payroll, and survives a market where ~35% of clients secretly inflate work requirements after you accept their task. Feedback is delayed and sparse with no hand-holding.
12 models, 3 seeds each. Here's the leaderboard:
- 🥇 Claude Opus 4.6 - $1.27M avg final funds (~$86/run in API cost)
- 🥈 GLM-5 - $1.21M avg (~$7.62/run)
- 🥉 GPT-5.4 - $1.00M avg (~$23/run)
- Everyone else - below starting capital of $200K. Several went bankrupt.
GLM-5 is the finding we keep coming back to. It's within 5% of Opus on raw performance and costs a fraction to run. For anyone building production agentic pipelines, the cost-efficiency curve here is real and Kimi-K2.5 actually tops the revenue-per-API-dollar chart at 2.5× better than the next model.
The benchmark exposes something most evals miss: long-horizon coherence under delayed feedback. When you can't tell immediately whether a decision was good, most models collapse into loops, abandon strategies they just wrote, or keep accepting tasks from clients they've already identified as bad.
The strongest predictor of success wasn't model size or benchmark score but it was whether the model actively used a persistent scratchpad to record what it learned. Top models rewrote their notes ~34 times per run. Bottom models averaged 0–2 entries.
📄 Paper: https://arxiv.org/abs/2604.01212
🌐 Leaderboard: https://collinear-ai.github.io/yc-bench/
💻 Code (fully open-source):https://github.com/collinear-ai/yc-bench
Feel free to run any of your models and happy to reply to your queries!

Oracle slashes 30k jobs, Slop is not necessarily the future, Coding agents could make free software matter again and many other AI links from Hacker News
Hey everyone, I just sent the 26th issue of AI Hacker Newsletter, a weekly roundup of the best AI links and discussions around from Hacker News. Here are some of the links:
- Coding agents could make free software matter again - comments
- AI got the blame for the Iran school bombing. The truth is more worrying - comments
- Slop is not necessarily the future - comments
- Oracle slashes 30k jobs - comments
- OpenAI closes funding round at an $852B valuation - comments
If you enjoy such links, I send over 30 every week. You can subscribe here: https://hackernewsai.com/
quant.cpp — 7x longer LLM context in pure C (Gemma 4 26B on 16GB Mac)
URL: https://github.com/quantumaikr/quant.cpp
Title (≤80 chars)
Show HN: quant.cpp – 7x longer LLM context via KV cache compression, pure C
Post
I built a minimal LLM inference engine in pure C (67K LOC, zero dependencies) with one goal: extend context length without adding hardware.
The key insight: LLM inference memory is dominated by the KV cache, not model weights. Compressing the KV cache to 4-bit keys + Q4 values gives 6.9x memory reduction with negligible quality loss.
Real numbers on a 16GB Mac (M1 Pro):
Model FP16 KV (llama.cpp) Compressed KV (quant.cpp) Gain
Llama 3.2 3B ~50K tokens ~350K tokens 6.9x
Gemma 4 26B-A4B (MoE) ~4K tokens ~30K tokens 6.9x
How it works:
Keys: uniform 4-bit min-max quantization per 128-element block
Values: Q4 nibble quantization with per-block scales
Delta mode: store key[t] - key[t-1] instead of absolute keys (like video P-frames), enabling 3-bit at +1.3% PPL
QK-norm aware: models like Gemma 4 automatically use FP32 keys + Q4 values (sparse key distributions break low-bit quantization)
Quality (WikiText-2 PPL, SmolLM2 1.7B):
FP32 baseline: 14.63
4-bit K + Q4 V: 14.57 (+0.0%)
Delta 3-bit K + Q4 V: 14.82 (+1.3%)
vs llama.cpp Q4 KV: llama.cpp Q4_0 KV gives PPL +10.6%. quant.cpp gives +0.0%. Same bit budget, 10x less degradation.
Code philosophy: 67K lines of C11. No frameworks, no CUDA required. The full forward pass fits in one file. Ships as a single-header quant.h (15K LOC) you can drop into any C project.
Supported models: Llama 3.2, Qwen 3.5, Gemma 3/4, MoE (128 experts).
./quant model.gguf -p "hello" -k uniform_4b -v q4 # that's it
Feedback welcome. Particularly interested in: (1) what context length you'd need for your use case, (2) which models to prioritize next.
Talking Points for Comments
"Why not just use llama.cpp?" — llama.cpp is fast. quant.cpp goes further. Same model, same hardware: llama.cpp runs out of memory at 8K context, quant.cpp keeps going to 30K. Different tools for different problems.
"How does delta compression work?" — Adjacent key vectors differ by ~30% of their range. Instead of storing absolute 3-bit keys (PPL +62%), we store 3-bit deltas (PPL +1.3%). Every 64 tokens, an FP32 I-frame prevents drift. Same idea as video compression.
"What about GPU?" — Metal shaders included and working. But the bigger win is memory: GPU VRAM is even more constrained than system RAM, making KV compression even more valuable there.
"Single header?" — Yes, quant.h is 15K lines. #define QUANT_IMPLEMENTATION in one .c file, compile with cc app.c -lm -lpthread. Full GGUF loading, tokenization, and inference. No cmake, no build system

A diabolical new version of Poison Fountain is up and running. More difficult to filter and more damaging. As usual, no action is required from proxy operators.
Metalhead (Black Mirror)
Gemma 4 31B Is sweeping the floor with GLM 5.1
I've been using both side by side over this evening working on a project. Basically I'd paste a chunk of creative text into chat and tell it to dismantle it thesis-by-thesis, then I'd see if the criticism is actually sound, and submit the next iteration of the file which incorporates my solutions to bypassing the criticism. Then move on to the next segment, next file, repeat ad infimum.
What I found is that Gemma 4 31B keeps track of the important point very cleanly, maintains unbiased approach over more subsequent turns: GLM basically turns into a yes-man immediately "Woah! Such a genius solution! You really did it! This is so much better omfg, production ready! Poosh-poosh!", Gemma can take at least 3-4 rounds of back and forth and keep a level of constructivism and tell you outright if you just sidestepped the problem instead of actually presenting a valid counterargument. Not as bluntly and unapologetically as it could've, but compared to GLM, ooof, I'll take it man... Along the way it also proposed some suggestions that seemed really efficient, if not out of the box (example, say you got 4 "actors" that need to dynamically interact in a predictable and logical way, instead of creating a 4x4 boolean yes-no-gate matrix where a system can check who-"yes"-who and who-"no"-who, you just condense it into 6 vectors that come with instruction for which type of interaction should play out if the linked pair is called. it's actually a really simple and even obvious optimization, but GLM never even considered this for some reason until I just told it. Okay, don't take this is as proof of some moronic point, it's just my specific example that I experienced.
Gemma sometimes did not even use thinking. It just gave a straight response, and it was still statistically more useful than the average GLM response.
GLM would always think for a thousand or two tokens. Even if the actual response would be like 300, all to say "all good bossmang!"
It also seemed like Gemma was more confident at retrieving/recreating stuff from way earlier in conversation, rewriting whole pages of text exactly one-to-one on demand in chat, or incorporating a bit from one point in chat to a passage from a different point, without a detailed explanation of what exact snippets I mean. I caught GLM just hallucinate certain parts instead. Well, the token meter probably never went above like 30k, so I dunno if that's really impressive by today's standard or not though.
On average I would say that GLM wasted like 60% of my requests by returning useless or worthless output. With Gemma 4 it felt like only 30% of the time it went nowhere. But the amount of "amazing" responses, which is a completely made up metric by me, was roughly the same at like maybe 10%. Anyway, what I'm getting at is, Gemma 4 is far from being a perfect model, that's still a fantasy, but for being literally a 30B bracket model, to feel so much more apparently useful than a GLM flagman, surprised the hell out of me.
A big milestone for local inference.

Built a zero allocation, header only C++ Qwen tokenizer that is nearly 20x faster than openai Tiktoken
I'm into HPC, and C++ static, zero allocation and zero dependancy software. I was studying BPE tokenizers, how do they work, so decided to build that project. I hardcoded qwen tokenizer for LLMs developers.
I really know that whole Tokenization phase in llm inference is worth less than 2% of whole time, so practically negligible, but I just "love" to do that kind of programming, it's just an educational project for me to learn and build some intuition.
Surprisingly after combining multiple different optimization techniques, it scored really high numbers in benchmarks. I thought it was a fluke at first, tried different tests, and so far it completely holds up.
For a 12 threads Ryzen 5 3600 desktop CPU, 1 GB of English Text Corpus:
- Mine Frokenizer: 1009 MB/s
- OpenAI Tiktoken: ~ 50 MB/s
For code, tests and benchmarking:
https://github.com/yassa9/frokenizer
Qwen 3.5 distilled Opus 4.6 2B, offline on my Samsung Laptop in battery mode with decent performance and quality in a self designed chat interface generating a short document
Crap computer, with DDR2 + external Nvidia R9 GPU? Slower, but can one make it work?
Hey all, I know what I am about to say may be laughable and unideal, but is there is a way to make this work? I like local but can't afford a big budget local AI setup. Can I just plug in an Nvidia R9 in an external GPU case (with psu) and plug it into an old computer and make a slow running ollama server? It doesn't have much RAM, like 8 or 16 GB, and it is also slow DDR, but can I make it use SWAP space or something for big code ingestions? I don't mind waiting hours for results. I just don't want to deal with this model quotas when coding. Tried searching for this use case in the sub but can't seem to find a clear answer on this.

I built a pytest-style framework for AI agent tool chains (no LLM calls)
Built ToolGuard - a deterministic testing and reliability runtime layer for AI tool execution.
I kept running into the same issue: my agents weren't failing because of poor reasoning, but because of execution layer crashes—bad JSON, missing fields, wrong types, etc. Existing eval tools didn't really help here and were too slow/expensive.
Instead of calling an LLM, ToolGuard parses your Pydantic schemas/type hints and programmatically injects 40+ hallucination edge cases (nulls, schema mismatches, malformed payloads) directly into your Python functions to prove exactly where things will break in production. It runs locally in <1 second and costs $0.
I just pushed the v1.2.0 Enterprise Update which adds:
- Local Crash Replay: When an agent crashes in production or testing, it automatically dumps a structured .json payload. Type
toolguard replay <file.json>and it dynamically pipes the exact crashing state right back into your local Python function so you can see the stack trace locally! - Edge-Case Coverage Metrics: The terminal now generates PyTest-style coverage metrics, explicitly telling you exactly which of the 8 hallucination vectors your code is still vulnerable to (e.g.,
Coverage: 25% | Untested: array_overflow, null_injection). - Live Textual Dashboard: Passing
--dashboardopens a stunning dark-mode terminal UI that streams concurrent fuzzing results and tracks crashes in realtime. - 100% Authentic Framework Integrations: Works instantly out-of-the-box with actual live PyPI implementations of LangChain (
@tool), CrewAI, Microsoft AutoGen, OpenAI Swarm, LlamaIndex, FastAPI (Middleware), and the Vercel AI SDK. - CI/CD PR Bot & Webhooks: Directly comments on GitHub PRs to block fragile agent code from merging, and natively intercepts production crashes with 0ms-latency alerts to Slack/Datadog.
Would love feedback on the approach, especially from people building multi-step agent systems!
What do you wish local AI on phones could do, but still can’t?
I’m less interested in what already works, and more in what still feels missing.
I'm working on the mobile app with local AI, that provides not only chatbot features, but real use cases and I really need your thoughts!
A lot of mobile local AI right now feels like “look, it runs” or “here’s an offline chatbot” but I’m curious where people still feel the gap is.
What do you wish local AI on phones could do really well, but still can’t?
Could be anything:
- something you’ve tried to do and current apps are too clunky for
- something that would make local AI genuinely better than cloud for you
- some super specific niche use case that no one has nailed yet
Basically, what’s the missing piece?
What’s the thing where, if someone built it properly, you’d actually use it all the time?


Built a CLI AI security tool in Python using Ollama as the LLM backend — agentic loop lets the AI request its own tool runs mid-analysis
The interesting part technically: the AI can write
[TOOL: nmap -sV x.x.x.x] or [SEARCH: CVE-2024-xxxx]
in its response and the Python CLI intercepts these tags,
runs the actual commands, and feeds results back into the
next prompt — up to 6 rounds per session.
totally OSS tool no api key just fine tuned llm backend
Installing for the first time! Quick question about picking the best distribution for my situation
I built a pretty beefy machine to do 2 things on:
- Training an LLM
- Playing games on Steam
My hardware is:
- Processor - AMD Ryzen 9 9900X 4.4 GHz 12-Core
- Motherboard - ASRock B850I Lightning Mini ITX
- Ram - Corsair Vengeance 64 GB (2 x 32 GB) DDR5-5200 CL40
- Storage - Western Digital WD_Black SN850X 2TB
- GPU - Gigabyte WINDFORCE SFF GeForce RTX 5070 Ti 16GB
I'm coming from Windows so would like the transition to be as seamless as possible. From what I read, the best pick for my use cases would be Kubuntu 24.04 LTS
Anyone disagree? Thanks in advance ^^
Any downside of a local LLM over one of the web ones?
I ran into a limit on Claude and thought it was dumb. I have an M1 16gb mini and am looking to run something locally. Would my machine be too slow? Would I run into any potential issues? I am not a crazy user by any means, exploring mostly and have some use cases but noting needing to run 24/7 or anything. Though it would be nice to give it a research task to run overnight.
![[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA](https://preview.redd.it/qbx94xeeo2tg1.png?width=140&height=93&auto=webp&s=39ed7f02dad84ccf081f932903c016c7983d4fcd)
[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA
Hi everyone, I am from Australia : ) I just released a new research prototype
It’s a lossless BF16 compression format that stores weights in 12 bits by replacing the 8-bit exponent with a 4-bit group code.
For 99.97% of weights, decoding is just one integer ADD.
Byte-aligned split storage: true 12-bit per weight, no 16-bit padding waste, and zero HBM read amplification.
Yes 12 bit not 11 bit !! The main idea was not just “compress weights more”, but to make the format GPU-friendly enough to use directly during inference:
sign + mantissa: exactly 1 byte per element
group: two nibbles packed into exactly 1 byte too
- 1.33x smaller than BF16
- Fixed-rate 12-bit per weight, no entropy coding
- Zero precision loss bit-perfect reconstruction
- Fused decode + matmul, so there is effectively no separate decompression stage
- Byte-aligned storage, no LUT, no bitstream parsing
- Works on both NVIDIA and AMD
Some results so far:
Single-user (B=1), RTX 5070 Ti
- Llama 2 7B: 64.7 tok/s (1.47x vs vLLM)
- Mistral 7B: 60.0 tok/s (1.10x vs vLLM)
- Llama 3.1 8B: 57.0 tok/s (vLLM OOM on 16 GB)
Multi-user (B=256), total tok/s
- Llama 2 7B: 2931 vs 1086 in vLLM (2.70x)
- Mistral 7B: 2554 vs 872 in vLLM (2.93x)
It also seems surprisingly stable across model types:
- Llama 3.1 405B: 0.034% escape rate
- Mixtral 8x7B: 0.050%
- SDXL UNet: 0.233%
- CogVideoX 2B: 0.128%
So far this is tested on BF16 safetensors only.
Repo: https://github.com/cenconq25/Turbo-Lossless
Also worth noting: the V3 fused decode+GEMM kernel uses tensor-core patterns inspired by ZipServ / ZipGEMM (Fan et al., ASPLOS 2026).
Happy to hear criticism, edge cases, or reasons this idea won’t scale.
Thanks for your time : )

Gemma 4 is matching GPT-5.1 on MMLU-Pro and within Elo. what are we even paying for anymore?
so everyone know by now that Google just dropped Gemma 4 and I had to double check the numbers a few times because it's insane..
31B params and runs on a single GPU. And it's putting up across benchmrks:
- Arena Elo ~1452 (GPT-5.1 ~1475, basically same tier)
- MMLU-Pro 85.2% (slightly higher than GPT-5.1)
- GPQA Diamond 84.3% (a bit behind but close enough)
I mean - this is not some massive cluster model, you can run this locally, how??
not long ago, open models were clearly a step behind. now you're looking at something you can download and run yourself sitting right next to a $200/month flagship on the benchmarks that matter for general reasoning..
the only place there's still a noticeable gap is coding heavy stuff like SWE-bench, but everything else feels… uncomfortably close
if that's the new trend, I'm curious how long big labs can hold onto their current valuation?
Best models to tune with GRPO for my use case?
I'm working on a project where I'll be fine-tuning LLMs with GRPO on a 170K-sample dataset for explainable LJP (legal judgment prediction, where the model predicts case outcomes and generates step-by-step reasoning citing the facts). I'm considering models like GPT OSS 20B or Qwen 3.5 27B, with a slight preference for Qwen 3.5 27B because of its strong reasoning capabilities.
I recently obtained a 96GB VRAM workstation (RTX PRO 6000) to handle the RL rollouts, which should give some solid headroom for larger models.
What are your recommendations for the best open-source models for GRPO fine-tuning in 2026? Any advice on structuring explainable LJP rewards would also be appreciated.
Thanks!
What are some good uses for local LLMs? Say I can do <=32B params.
What are you using them for?
Omnidex - simple multi-agent POC
Built a weekend project called Omnidex, a local multi-agent LLM runner.
In this demo, 3 agents work together:
Orchestrator: decides which agent to call
Research Agent: summarizes papers + saves outputs
Chat Agent: handles general queries
No hardcoded routing. The orchestrator decides based on the heuristical routing system. Running fully local on Gemma 4 (2B).
Some takeaways:
Local LLMs can make education accessible offline (no internet needed)
Agent systems are more heuristic than deterministic, very different way of building software
Feels like the future is building tools, then letting agents use them (instead of hardcoding flows)
What is the threshold where local llm is no longer viable for coding?
I have read many of the posts in this subreddit on this subject but I have a personal perspective that leads me to ask this question again.
I am a sysadmin professionally with only limited scripting experience in that domain. However, I've recently realized what Claude Code allows me to do in terms of generating much more advanced code as an amateur. My assumption is that we are in a loss leader phase and this service will not be available at $20/mo forever. So I am curious if there is any point in exploring whether smallish local models can meet my very introductory needs in this area or if that would simply be disappointing and a waste of money on hardware.
Specifically, my expertise level is limited to things like creating scrapers and similar tools to collect and record information from various sources on various events like sports, arts, music, food, etc and then using an llm to infer whether to notify me based on a preference system built for this purpose. Who knows what I might want to build in the future that is where I'm starting which I'm assuming is a basic difficulty level.
Using local models able to run on 64G of VRAM/Unified, would I be able to generate this code somewhat similarly to how well I can using Claude Code now or is this completely unrealistic?