
Netflix just dropped their first public model on Hugging Face: VOID: Video Object and Interaction Deletion
Hugging Face netflix/void-model: https://huggingface.co/netflix/void-model
Project page - GitHub: https://github.com/Netflix/void-model

Hugging Face netflix/void-model: https://huggingface.co/netflix/void-model
Project page - GitHub: https://github.com/Netflix/void-model

Been playing with the new Gemma 4 models it’s amazing great even but boy did it make me appreciate the level of quality the qwen team produced and I’m able to have much larger context windows on my standard consumer hardware.



I am still processing this lol.
I had Gemini 3 Pro Deepthink try to solve a complex security puzzle (which was secretly an unwinnable paradox). It spit out this incredibly professional-looking, highly structured answer after about 15 minutes of reasoning. Just for fun, I passed its solution over to Gemma 4 (31B) (with tools enabled).
Gemma completely tore it apart. It caught a hard physical constraint violation and a fake math equation that Gemini tried to sneak by me to force the answer. It explicitly called out the fatal logic flaw and told Gemini it was "blinded by the professionalism of the output." Brutal.
The craziest part? I fed the 31B's arguments back to Deepthink... and it immediately folded, acknowledging that its internal verification failed and its logic was broken.
I've attached the HTML log so you guys can read the whole debate. The fact that a 31B open-weight model can perform an agentic peer-review and bully a frontier MoE model into submission is insane to me. Check out the file.
TIL: Bigger model isn't smarter... Well atleast not all the time.

I mean, I have 40GB of Vram and I still cannot fit the entire Unsloth Gemma-4-31B-it-UD-Q8 (35GB) even at 2K context size unless I quantize KV to Q4 with 2K context size? WTF? For comparison, I can fit the entire UD-Q8 Qwen3.5-27B at full context without KV quantization!
If I have to run a Q4 Gemma-4-31B-it-UD with a Q8 KV cache, then I am better off just using Qwen3.5-27B. After all, the latter beats the former in basically all benchmarks.
What's your experience with the Gemma-4 models so far?

Another day another git pull

Hey r/LocalLLaMA,
A few of you asked about my hardware setup in my previous post. I promised photos and details. Here's the full story of how a tiny MiniPC ended up with 120 GB VRAM across 4 GPUs — and the frustrating journey to get there. (Of course we love to fool ourselves with those numbers — nvidia-smi says ~115 GB usable. The other 5 GB? CUDA overhead. Gone. Poof.)
TL;DR: AOOSTAR GEM 10 Pro Max MiniPC, 3x Tesla P40 (24 GB each) + 1x Quadro RTX 8000 (48 GB) = ~120 GB VRAM (~115 GB usable). Runs 235B parameter models fully GPU-resident, 24/7, at ~60W idle. Cost me way too many evenings and one ruined fan grille.
I originally bought it as a simple home server. Then I discovered that you can hang GPUs off it. That's where things got out of hand.
Before buying anything, I asked AOOSTAR support if the GEM 10 could drive two eGPU adapters simultaneously via OCuLink + USB4. They confirmed it, so I went ahead and bought the AG01 (OCuLink) + AG02 (USB4) together with two Tesla P40s. Plugged them in — both worked immediately. 48 GB total VRAM from day one. The MiniPC handles both OCuLink and USB4 simultaneously — they don't share lanes.
Now I could run 80B MoE models. I thought "this is great, I'm done."
I was not done.
This is where it gets creative. I bought an M.2-to-OCuLink adapter, opened up the MiniPC, plugged it into one of the two free M.2 slots. Then I realized I needed to get the OCuLink cable out of the case somehow.
Solution: I took a saw to the fan grille on the side panel. Cut a slot just wide enough for the cable. Not pretty, but it works. Connected another AG01 adapter with a third P40. 72 GB total.
I bought a Quadro RTX 8000 (48 GB) with the plan to eventually replace all P40s with RTX 8000s for maximum VRAM. The dream: 4x 48 GB = 192 GB.
First problem: The RTX 8000 would NOT work in the AG01 connected via the internal M.2-to-OCuLink adapter. It wouldn't even complete POST — just hung at the handshake. The P40s worked fine in the same slot. Tried different BIOS settings, tried the Smokeless BIOS tool to access hidden UEFI variables — nothing helped.
So I moved it to the AG02 (USB4). It worked there, but that meant I lost the opportunity to expand the system to four RTX 8000 in total. Days of frustration.
By chance I stumbled upon ReBarUEFI by xCuri0. The problem was that the GEM 10's BIOS doesn't expose Resizable BAR settings, and the RTX 8000 needs a BAR larger than the default 256 MB to work over OCuLink. The P40s are older and don't care.
ReBarState writes the BAR size directly into the UEFI NVRAM. I set it to 4 GB, rebooted — and suddenly the RTX 8000 worked over OCuLink. In the AG01, in the M.2-to-OCuLink adapter, everywhere. I nearly fell off my chair.
Big shout-out to AOOSTAR support — they were involved from day one. They confirmed dual-eGPU would work before I bought anything, said internal M.2-to-OCuLink should work in principle (it did), and confirmed "Above 4G Decoding" is enabled in the BIOS even though there's no visible toggle. Fast responses, honest answers. Can't complain.
With ReBAR sorted, I bought one more AG01 adapter and another M.2-to-OCuLink adapter (second sawed slot in the fan grille). Final configuration:
| GPU | VRAM | Connection | Adapter |
|---|---|---|---|
| Tesla P40 #1 | 24 GB | OCuLink (external port) | AG01 |
| Tesla P40 #2 | 24 GB | M.2 → OCuLink (internal, sawed grille) | AG01 |
| Tesla P40 #3 | 24 GB | M.2 → OCuLink (internal, sawed grille) | AG01 |
| RTX 8000 | 48 GB | USB4 (external port) | AG02 |
| Total | 120 GB (~115 usable) |
Each connection runs at PCIe x4 — not shared, not throttled. Measured and verified. It's not x16 server speed, but for LLM inference where you're mostly doing sequential matrix multiplications, it's absolutely fine.
The P40s and RTX 8000 are server/workstation cards — passive designed for chassis airflow that doesn't exist in an open shelf. So I 3D-printed (and designed for the RTX 8000) fan adapters and mounted BFB1012HH fans on each card with a temperature-controlled fan controller. I initially tried higher-CFM fans of the same size (BFB1012VH) but they were unbearably loud and didn't actually cool any better. The BFB1012HH are the sweet spot — quiet enough to live with, even at full speed. Works great — even at 100% GPU load on a single card, nvidia-smi rarely shows temperatures above 50C. The eGPU adapters have small built-in fans, but I've rarely heard them spin up — they just pass through PCIe, not much to cool there.
| Component | Price | Source |
|---|---|---|
| AOOSTAR GEM 10 MiniPC | ~EUR450 | New (bought before the RAM price surge — should have gotten the 64GB version) |
| Tesla P40 #1 + #2 | ~EUR190 each | AliExpress (+ customs to EU) |
| Tesla P40 #3 | ~EUR200 | AliExpress (+ customs) |
| RTX 8000 | ~EUR1,200 | Used, Germany |
| AG01 eGPU adapter (x3) | ~EUR155 each | AOOSTAR |
| AG02 eGPU adapter (x1) | ~EUR210 | AOOSTAR |
| M.2-to-OCuLink adapters (x2, K49SQBK, PCIe 5.0, active chip) | ~EUR45-50 each + customs | AliExpress |
| BFB1012HH fans (x4) | ~EUR10 each | AliExpress |
| PWM fan controllers w/ temp probes (x4) | ~EUR10 each | AliExpress |
| 3D-printed fan adapters | Free (self-printed) | |
| Total | ~EUR3,200 |
For ~EUR3,200 you get a 120 GB VRAM (~115 GB usable) inference server that runs 235B models 24/7 at 60W idle. Not bad. The RTX 8000 is the big ticket item — if you go all-P40 (4x 24GB = 96GB) you'd be under EUR2,000.
That's a 120 GB VRAM (~115 GB usable) inference server at 60W idle power. Try that with a proper server rack.
All running through llama.cpp via llama-swap with Direct-IO and flash attention. Model swaps take ~20-30 seconds thanks to Direct-IO memory mapping.
| Model | Size | Quant | GPUs | Tensor Split | Context | KV Cache | TG tok/s |
|---|---|---|---|---|---|---|---|
| Qwen3-4B Instruct | 4B | Q8_0 | 1 (RTX 8000) | — | 262K | f16 | ~30 |
| Qwen3-14B Base | 14B | Q4_K_M | 1 (RTX 8000) | — | 41K | f16 | ~25 |
| Qwen3-30B-A3B Instruct | 30B MoE | Q8_0 | 2 | — | 262K | f16 | ~35 |
| Qwen3-VL-30B-A3B (Vision) | 30B MoE | Q8_0 | 2 | — | 262K | f16 | ~30 |
| GPT-OSS-120B-A5B | 120B MoE | Q8_K_XL | 2 | 2:1:1:1 | 131K | f16 | ~50 |
| Qwen3-Next-80B-A3B | 80B MoE | Q8_K_XL | 4 | 22:9:9:8 | 262K | f16 | ~35 |
| Qwen3.5-122B-A10B | 122B MoE | Q5_K_XL | 4 | 2:1:1:1 | 262K | f16 | ~20 |
| Nemotron-3-Super-120B | 120B NAS-MoE | Q5_K_XL | 4 | 2:1:1:1 | 874K | f16 | ~17 |
| Qwen3-235B-A22B Instruct | 235B MoE | Q3_K_XL | 4 | 2:1:1:1 | 112K | q8_0 | ~11 |
All models GPU-only (ngl=99), flash-attn, Direct-IO, mlock. Context sizes auto-calibrated by AIfred to maximize available VRAM. The 2:1:1:1 tensor split means RTX 8000 gets twice as many layers as each P40 (proportional to VRAM: 48:24:24:24). Qwen3-Next-80B uses a custom 22:9:9:8 split optimized by AIfred's calibration algorithm.
llama-swap handles model lifecycle — models auto-swap on request, Direct-IO makes loading near-instant (memory-mapped), full init ~20-30s.
Next upgrade: If I can get another RTX 8000 at a reasonable price, I'll swap out a P40. The dream of 4x RTX 8000 = 192 GB VRAM is still alive — now that ReBAR is sorted, it's just a matter of finding the cards.
Frankenstein MiniPC — close-up of the MiniPC with OCuLink and USB4 cables, eGPU adapters
The MiniPC (bottom center) with OCuLink cables running to the AG01 adapters and USB4 to the AG02. Yes, those are two Ethernet cables (yellow) — one for LAN, one for direct point-to-point RPC to my dev machine.
The full setup — eGPU shelf of doom
The complete "server rack" — a wooden shelf with 3x AG01 + 1x AG02 eGPU adapters, each holding a GPU. The desk fan is for me, not the GPUs :-)
GitHub: https://github.com/Peuqui/AIfred-Intelligence-Legacy
All of this powers AIfred Intelligence — my self-hosted AI assistant with multi-agent debates, web research, voice cloning, and more. Previous posts: original | benchmarks
Now, if someone points out that for EUR3,200 you could have gotten a 128 GB unified memory MiniPC and called it a day — yeah, you're probably not wrong. But I didn't know from the start where this was going or how much it would end up costing. It just... escalated. One GPU became two, two became four, and suddenly I'm sawing fan grilles. That's how hobbies work, right? And honestly, the building was half the fun.
If you're thinking about a similar setup — feel free to ask. I've made all the mistakes so you don't have to :-)
Best, Peuqui

Just got Gemma 4 31B running at full 256K context on a single RTX 5090 using TurboQuant KV cache compression.
| Component | Spec |
|---|---|
| GPU | NVIDIA GeForce RTX 5090 (32GB VRAM) |
| CPU | AMD Ryzen 9 9950X3D (16-core) |
| RAM | 64GB DDR5 |
| OS | Windows 11 |
gemma-4-31B-it-UD-Q4_K_XL from Unsloth (17.46 GiB)feature/turboquant-kv-cache, merged with latest upstream master for Gemma 4 supportturbo3 (3-bit PolarQuant + Hadamard rotation, ~4.5x compression vs f16)--n-gpu-layers 99 --no-mmap --flash-attn on --cache-type-k turbo3 --cache-type-v turbo3| Test | Speed (t/s) |
|---|---|
| pp4096 | 3,362.71 |
| pp16384 | 3,047.00 |
| pp65536 | 2,077.96 |
| pp131072 | 1,428.80 |
| pp262144 | 899.55 |
| tg128 | 61.51 |
256K full context fits on a single 5090 — The turbo3 KV cache compresses K/V from 8 bits to effectively 3 bits with near-zero quality loss (based on the TurboQuant paper, arXiv 2504.19874). Without it, 256K would be impossible on 32GB VRAM.
Prompt processing scales predictably — Roughly halving speed per 4x context increase due to O(n²) attention.
Token generation is constant — 61.5 t/s regardless of context length. Memory bandwidth bound.
Gemma 4 support required fixes — Had to fix an MSVC bug in llama.cpp where std::transform with (const bool*) fails to correctly read GGUF bool arrays beyond ~48 elements in Release builds. This breaks the SWA (sliding window attention) layer pattern for Gemma 4's hybrid attention architecture. Fix: replace with manual uint8_t* loop.
If you're building TheTom's TurboQuant fork on Windows:
ggml-turbo-quant.c — Add #define _USE_MATH_DEFINES before #include <math.h> (MSVC doesn't define M_PI by default)ggml-cpu/ops.cpp — Add extern "C" int turbo3_cpu_wht_group_size; at file scope (C/C++ linkage mismatch)llama-model-loader.cpp — Replace the std::transform((const bool*)...) in get_arr() with a manual uint8_t* loop (MSVC optimization bug with bool pointer casting)-DBUILD_SHARED_LIBS=OFF to avoid DLL symbol export issues with the turbo globals-DCMAKE_CUDA_ARCHITECTURES=120a for RTX 5090 (sm_120a required for MXFP4 tensor core instructions)
I'm into HPC, and C++ static, zero allocation and zero dependancy software. I was studying BPE tokenizers, how do they work, so decided to build that project. I hardcoded qwen tokenizer for LLMs developers.
I really know that whole Tokenization phase in llm inference is worth less than 2% of whole time, so practically negligible, but I just "love" to do that kind of programming, it's just an educational project for me to learn and build some intuition.
Surprisingly after combining multiple different optimization techniques, it scored really high numbers in benchmarks. I thought it was a fluke at first, tried different tests, and so far it completely holds up.
For a 12 threads Ryzen 5 3600 desktop CPU, 1 GB of English Text Corpus:
- Mine Frokenizer: 1009 MB/s
- OpenAI Tiktoken: ~ 50 MB/s
For code, tests and benchmarking:
https://github.com/yassa9/frokenizer
idk but this thing feels like magic in the palm of my hands. I am running it on my Pixel 10 Pro with AI Edge Gallery by Google. The phone itself is only using CPU acceleration for some reason and therefore the E4B version felt a little to slow. However, with the E2B it runs perfect. Faster than I can read and follow along and has some function calling in the app. I am running it at the max 32K context and switch thinking on and off when I need.
It seem ridiculously intelligent. Feels like a 7b model.
I'm sure there is some recency bias here. But just having it run at the speed it does on my phone with it's intelligence feels special.
Are you guys having a good experience with the E models?
Waiting for artificialanalysis to produce intelligence index, but I see it's good. Gemma 26b a4b is the same speed on Mac Studio M1 Ultra as Qwen3.5 35b a3b (~1000pp, ~60tg at 20k context length, llama.cpp). And in my short test, it behaves way, way better than Qwen, not even close. Chain of thoughts on Gemma is concise, helpful and coherent while Qwen does a lot of inner-gaslighting, and also loops a lot on default settings. Visual understanding is very good, and multilingual seems good as well. Tested Q4_K_XL on both.
I wonder if mlx-vlm properly handles prompt caching for Gemma (it doesn't work for Qwen 3.5).
Too bad it's KV cache is gonna be monstrous as it did not implement any tricks to reduce that, hopefully TurboQuant will help with that soon. [edit] SWA gives some benefits, KV cache is not as bad as I thought, people report that full 260K tokens @ fp16 is like 22GB VRAM (for KV cache, quantized model is another ~18GB @ Q4_K_XL). It is much less compacted than in Qwen3.5 or Nemotron, but I can't say they did nothing to reduce KV cache footprint.
I expect censorship to be dogshit, I saw that e4b loves to refuse any and all medical advice. Maybe good prompting will mitigate that as "heretic" and "abliterated" versions seem to damage performance in many cases.
No formatting because this is handwritten by a human for a change.
[edit] Worth to note that Google's AI studio version of Gemma 26b a4b is very bad. It underperforms my GGUF with tokenizer issues :)

Hi! Just checking, am I the only one who has serious issues with Gemma 4 locally?
I've played around with Gemma 4 using Unsloth quants on llama.cpp, and it's seriously broken. I'm using the latest changes from llama.cpp, along with the reccomended temperature, top-p and top-k.
Giving it an article and asking it to list all typos along with the corrected version gives total nonsense. Here is a random news article I tested it with: https://www.bbc.com/news/articles/ce843ge47z4o
I've tried the 26B MoE, I've tried the 31B, and I've tried UD-Q8_K_XL, Q8_0, and UD-Q4_K_XL. They all have the same issue.
As a control, I tested the same thing in Google AI Studio, and there the models work great, finding actual typos instead of the nonsense I get locally.

My first Gemma 4 uncensors are out. Two models dropping today, the E4B (4B) and E2B (2B). Both Aggressive variants, both fully multimodal.
Aggressive means no refusals. I don't do any personality changes or alterations. The ORIGINAL Google release, just uncensored.
Gemma 4 E4B (4B): https://huggingface.co/HauhauCS/Gemma-4-E4B-Uncensored-HauhauCS-Aggressive
Gemma 4 E2B (2B): https://huggingface.co/HauhauCS/Gemma-4-E2B-Uncensored-HauhauCS-Aggressive
0/465 refusals* on both. Fully unlocked with zero capability loss.
These are natively multimodal so text, image, video, and audio all in one model. The mmproj file is included for vision/audio support.
What's included:
E4B: Q8_K_P, Q6_K_P, Q5_K_P, Q5_K_M, Q4_K_P, Q4_K_M, IQ4_XS, Q3_K_P, Q3_K_M, IQ3_M, Q2_K_P + mmproj
E2B: Q8_K_P, Q6_K_P, Q5_K_P, Q4_K_P, Q3_K_P, IQ3_M, Q2_K_P + mmproj
All quants generated with imatrix. K\_P quants use model-specific analysis to preserve quality where it matters most, effectively 1-2 quant levels better at only ~5-15% larger file size. Fully compatible with llama.cpp, LM Studio, or anything that reads GGUF (Ollama might need tweaking by the user).
Quick specs (both models):
- 42 layers (E4B) / 35 layers (E2B)
- Mixed sliding window + full attention
- 131K native context
- Natively multimodal (text, image, video, audio)
- KV shared layers for memory efficiency
Sampling from Google: temp=1.0, top_p=0.95, top_k=64. Use --jinja flag with llama.cpp.
Note: HuggingFace's hardware compatibility widget doesn't recognize K_P quants so click "View +X variants" or go to Files and versions to see all downloads. K_P showing "?" in LM Studio is cosmetic only, model loads fine.
Coming up next: Gemma 4 E31B (dense) and E26B-A4B (MoE). Working on those now and will release them as soon as I'm satisfied with the quality. The small models were straightforward, the big ones need more attention.
*Google is now using techniques similar to NVIDIA's GenRM, generative reward models that act as internal critics, making true, complete uncensoring an increasingly challenging field. These models didn't get as much manual testing time at longer context as my other releases. I expect 99.999% of users won't hit edge cases, but the asterisk is there for honesty. Also: the E2B is a 2B model. Temper expectations accordingly, it's impressive for its size but don't expect it to rival anything above 7B.
All my models: HuggingFace-HauhauCS
As a side-note, currently working on a very cool project, which I will resume as soon I publish the other 2 Gemma models.
Just tested Gemma 4 2B locally on old rtx2060 6GB VRAM and used Qwen3.5 in all sizes intensively, in customer projects before.
First impression from Gemma 4 2B: It's better, faster, uses less memory than q3.5 2B. More agentic, better mermaid charts, better chat output, better structured output.
It seems like either q3.5 are benchmaxed (although they really were much better than the competition) or google is playing it down. Gemma 4 2B "seems" / "feels" more like Q3.5 9B to me.

TLDR: add -np 1 to your llama.cpp launch command if you are the only user, cuts SWA cache VRAM by 3x instantly
So I was messing around with Gemma 4 and noticed the dense model hogs a massive chunk of VRAM before you even start generating anything. If you are on 16GB you might be hitting OOM and wondering why.
The culprit is the SWA (Sliding Window Attention) KV cache. It allocates in F16 and does not get quantized like the rest of the KV cache. A couple days ago ggerganov merged a PR that accidentally made this worse by keeping the SWA portion unquantized even when you have KV cache quantization enabled. It got reverted about 2 hours later here https://github.com/ggml-org/llama.cpp/pull/21332 so make sure you are on a recent build.
A few things that actually help with VRAM:
The SWA cache size is calculated as roughly (sliding window size × number of parallel sequences) + micro batch size. So if your server is defaulting to 4 parallel slots you are paying 3x the memory compared to a single user setup. Adding -np 1 to your launch command if you are just chatting solo cuts the SWA cache from around 900MB down to about 300MB on the 26B model and 3200MB to just 1200MB for the 31B dense model
Also watch out for -ub (ubatch size). The default is 512 and that is fine. If you or some guide told you to set -ub 4096 for speed, that bloats the SWA buffer massively. Just leave it at default unless you have VRAM to burn.
On 16GB with the dense 31B model you can still run decent context with IQ3 or Q3_K quantization but you will likely need to drop the mmproj (vision) to fit 30K+ context(fp16). With -np 1 and default ubatch it becomes much more manageable.

Ran a quick inference sweep on gemma 4 31B in NVFP4 (using nvidia/Gemma-4-31B-IT-NVFP4). The NVFP4 checkpoint is 32GB, half of the BF16 size from google (63GB), likely a mix of BF16 and FP4 roughly equal to FP8 in size. This model uses a ton of VRAM for kv cache. I dropped the kv cache precision to FP8.
All numbers are steady-state averages under sustained load using locust and numbers below are per-user metrics to show user interactivity. 1K output. vLLM.
| Context | 1 User | 2 Users | 3 Users | 4 Users |
|---|---|---|---|---|
| 1K | 40.7 | 36.6 | 36.1 | 35.1 |
| 8K | 39.9 | 36.5 | 34.8 | 32.7 |
| 32K | 40.5 | 28.9 | 25.3 | 23.5 |
| 64K | 44.5 | 27.4 | 26.7 | 14.3 |
| 96K | 34.4 | 19.5 | 12.5 | 9.5 |
| 128K | 38.3 | - | - | - |
| Context | 1 User | 2 Users | 3 Users | 4 Users |
|---|---|---|---|---|
| 1K | 0.1s | 0.1s | 0.2s | 0.2s |
| 8K | 1.0s | 1.4s | 1.7s | 2.0s |
| 32K | 5.5s | 8.1s | 10.0s | 12.6s |
| 64K | 15.3s | 22.4s | 27.7s | 28.7s |
| 96K | 29.6s | 42.3s | 48.6s | 56.7s |
| 128K | 47.7s | - | - | - |
| Concurrent | 1 | 2 | 3 | 4 | 23 | 25 | 30 | 32 |
|---|---|---|---|---|---|---|---|---|
| Decode (tok/s) | 39.9 | 36.5 | 34.8 | 32.8 | 22.5 | 18.5 | 16.6 | 15.3 |
| TTFT | 1.0s | 1.4s | 1.7s | 2.0s | 7.7s | 7.4s | 8.9s | 9.3s |
Decode speed is in the same ballpark as Qwen3.5 27B FP8 on this GPU. But prefill is much slower. Definitely need to enable caching to make long context usable especially for multiple users.
I'll retest if there are noticeable performance improvements over the next few days. I'm also looking for FP8 checkpoints for the other Gemma models to test. No point in testing the BF16 weights on this card.
Tested both 26b and 31b in AI Studio.
The task I asked of it was to crack a cypher. The top closed source models can crack this cypher at max thinking parameters, and Kimi 2.5 Thinking and Deepseek 3.2 are the only open source models to crack the cypher without tool use. (Of course, with the closed models you can't rule out 'secret' tool use on the backend.)
When I first asked these models to crack the cypher, they thought for a short amount of time and then both hallucinated false 'translations' of the cypher.
I added this to my prompt:
>Spare no effort to solve this, the stakes are high. Increase your thinking length to maximum in order to solve it. Double check and verify your results to rule out hallucination of an incorrect response.
I did not expect dramatic results (we all laugh at prompting a model to 'make no mistakes' after all). But I was surprised at the result.
The 26B MoE model reasoned for ten minutes before erroring out (I am supposing AI Studio cuts off responses after ten minutes).
The 31B dense model reasoned for just under ten minutes (594 seconds in fact) before throwing in the towel and admitting it couldn't crack it. But most importantly, it did not hallucinate a false answer, which is a 'win' IMO. Part of its reply:
>The message likely follows a directive or a set of coordinates, but without the key to resolve the "BB" and "QQ" anomalies, any further translation would be a hallucination.
I honestly didn't expect these (relatively) small models to actually crack the cypher without tool use (well, I hoped, a little). It was mostly a test to see how they'd perform.
I'm surprised to report that:
they can and will do very long form reasoning like Qwen, but only if asked, which is how I prefer things (Qwen tends to overthink by default, and you have to prompt it in the opposite direction). Some models (GPT, Gemini, Claude) allow you to set thinking levels/budgets/effort/whatever via parameters, but with Gemma it seems you can simply ask.
it's maybe possible to reduce hallucination via prompting - more testing required here.
I'll be testing the smaller models locally once the dust clears and the inevitable new release bugs are ironed out.
I'd love to know what sort of prompt these models are given on official benchmarks. Right now Gemma 4 is a little behind Qwen 3.5 (when comparing the similar sized models to each other) in benchmarks, but could it catch up or surpass Qwen when prompted to reason longer (like Qwen does)? If so, then that's a big win.

Just wanted to make folks aware as I just grabbed one and it says delivers less than a week. https://www.newegg.com/intel-arc-pro-b70-32gb-graphics-card/p/N82E16814883008
I have been self hosting LLMs since before llama 3 was a thing and Gemma 4 is the first model that actually has a 100% success rate in my tool calling tests.
My main use for LLMs is a custom built voice assistant powered by N8N with custom tools like websearch, custom MQTT tools etc in the backend. The big thing is my household is multi lingual we use English, German and Japanese. Based on the wake word used the context, prompt and tool descriptions change to said language.
My set up has 68 GB of VRAM (double 3090 + 20GB 3080) and I mainly use moe models to minimize latency, I previously have been using everything from the 30B MOEs, Qwen Next, GPTOSS to GLM AIR and so far the only model which had a 100% success rate across all three languages in tool calling is Gemma4 26BA4B.