r/LocalLLaMA

▲ 15 r/LocalLLaMA

Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences?

Some people say they’d never go under Q8, and others say they find Q3 acceptable! What’s your take?

reddit.com

u/Borkato — 4 hours ago

▲ 48 r/LocalLLaMA

LM Studio finally added support for MTP Speculative Decoding

https://preview.redd.it/1uuzjm0ll72h1.png?width=923&format=png&auto=webp&s=1af7d7594be1e08ff7ad6797e2bc53e9410769a3

update to 0.4.14 Build 2 (Beta) and make sure your llama.cpp engine is 2.15.0

https://preview.redd.it/x0vdwjb3n72h1.png?width=742&format=png&auto=webp&s=6367de44208004d2f50194d78a542c46b040dceb

you also must select "Manually choose model load parameters" and enable MTP in those before loading the model it is NOT on by default

reddit.com

u/pigeon57434 — 3 hours ago

▲ 87 r/LocalLLaMA

48GB VRAM users, what are your daily drivers? Do you wish you had more VRAM? What would you run if you did?

I’m upgrading from 32 to 48 soon and am excited but I’m curious what y’all run!

reddit.com

u/Borkato — 8 hours ago

▲ 20 r/LocalLLaMA

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

Hey r/DeepSeek,

Who says we need an H100 cluster or the latest expensive GPUs to run frontier MoE models? I wanted to see how far we could push a single node of consumer legacy hardware, so we spent less than $2,500 total to build a budget machine that successfully runs DeepSeek-V4-Flash (284B total, 13B active) locally!

Surprisingly, we managed to hit around 255 prefill tokens/s with a very tight memory budget.

https://preview.redd.it/cfefgc71732h1.png?width=1772&format=png&auto=webp&s=5c673acca7a2a73cfbd0d2059e25102462c56dfc

Here is a quick breakdown of how we achieved this "legacy donkey pulling a massive MoE chariot" feat via hardware-software co-optimization:

⚡️ The Technical Breakthroughs

Custom Turing CUDA Kernels: The 2080 Ti Tensor Cores are still capable, but PCIe Gen3 and VRAM bandwidth are huge bottlenecks. We rewrote custom CUDA kernels tailored specifically for the Turing architecture to accelerate W8A8 (INT8) matrix multiplication, heavily alleviating the bandwidth choke.
Heterogeneous Inference: Optimized static memory splitting and dynamic offloading between the 4x 11/22GB VRAM and 1TB system RAM. 100% of the hardware capacity is utilized.
Computation-Communication Overlap: Implemented a pipelined execution strategy to hide the massive multi-GPU communication overhead caused by MoE routing.

https://preview.redd.it/5ltwol3z632h1.png?width=2414&format=png&auto=webp&s=6c4c4dcf62737f7f5dcb9a5b8d4aa3f422f7edae

🖥️ Budget Hardware Specs

CPU: Intel Xeon E5-2696 v4 (The classic budget king for multi-core)
GPU: 4x RTX 2080 Ti (11/22GB each)
RAM: 1TB DDR4 ECC

The entire implementation, deployment script, and preliminary tech report are 100% open-sourced. I'd love to hear your thoughts, benchmarks, or feedback from fellow system/compiler hackers here!

🔗 **GitHub Repository:**https://github.com/lvyufeng/deepseek-v4-2080ti

(Note: I submitted the detailed report to arXiv a few days ago, but it’s currently caught in the manual moderation queue—likely because a rookie author throwing a 2080 Ti at DeepSeek-V4 triggered their review boundaries lol. Will update with the arXiv link once it's cleared!)

https://reddit.com/link/1ti5sxu/video/uu9ea2l0v62h1/player

https://reddit.com/link/1ti5sxu/video/if6alov1v62h1/player

reddit.com

u/Known_Ice9380 — 5 hours ago

▲ 18 r/LocalLLaMA

Newbie vibe coding experience: Shifting from Claude Sonnet 4.6 to Qwen3.6-35B-A3B-UD-Q6_K

This is really just a post for those with shallow understanding of all this stuff, those not yet ready or capable of diving into the deeper end of vibe coding/llms. It might not be a helpful post for anyone more advanced than that.

I have been working on a Python Pygame project for about two months. It is now sitting at roughly 30k lines of code across 55 modules. I have been using Visual Studio Code, Copilot Pro+, and around three times the cost of pro+ in additional premium requests per month.

I initially started with Claude Opus, which was brilliant, but it became too expensive. I then moved to Claude Sonnet 4.6, which worked reasonably well at first. But over time I started seeing more and more messages like, “Sorry, the response hit the length limit. Please rephrase your prompt.” It also began struggling to resolve some bugs, even after many prompt attempts.

Generally, the thinking and reasoning periods seemed to get longer without producing useful outcomes, which meant tokens were being spent for very little return. I tried several ways to minimise this, but the same issues kept coming up.

I decided to install Ollama and Cline and use Qwen3.6... which has been going really well. It has already solved a few bugs that Sonnet seemed unable to resolve. I do need to be more mindful with prompts and context window management, but that feels like less of an obstacle than the issues I was having with Sonnet.

When my Copilot Pro+ allowance refreshes, I plan to get Claude Opus to review the code and give me a sense of how well Qwen3.6 has handled things. If the review is positive, I think that may be the end of my Copilot subscription for now.

I also want to acknowledge that before leaving Opus, I used it to modularise the program from one large monolithic Python file into smaller files and modules, with each file responsible for a specific part of the game. I think that made a big difference and helped both Sonnet and Qwen3.6 work much more effectively. For any newbie coders, I do think there is good merit in getting Claude Opus to setup and structure your program initially.

For context, my hardware is probably above average, with a 5090 and a 4000 Pro (56 GB of VRAM) , running a 250k context on Qwen3.6 within Cline.

reddit.com

u/sooki10 — 8 hours ago

▲ 164 r/LocalLLaMA

Intel's Crescent Island PCB Leaks, Showing a Massive Xe3P GPU, 16-Pin Connector, 160GB LPDDR5X as Intel Sidesteps the HBM Shortage

Upcoming Intel Xe3P data center GPU with 20 8GBLPDDR5X modules for a total of 160GB, bypassing HBM shortages.

Assuming a 32-bit interface, that's a 640-bit wide memory interface, or 10 channel memory interface if converted to the 64-bit wide desktop equivalent. At 8800-9500MT, that's a 704-760GB/s memory bandwidth.

wccftech.com

u/FullstackSensei — 11 hours ago

▲ 56 r/LocalLLaMA

Google AI Edge Gallery v1.0.13 & v1.0.14 updates: Gemma 4 Multi-Token Prediction, Pixel TPU support, experimental MCP, new skills, now saves chat history

u/AnticitizenPrime — 9 hours ago

▲ 59 r/LocalLLaMA+1 crossposts

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using BeeLlama v0.1.2, with some backstory of unsuccessfully trying other tests and then re-exploring PPL and KLD even more thoroughly to compensate.

Tests were done with Qwen 3.6 27B (Q5_K_S and IQ4_XS) at 64k and 128k context, so a decent model with decent quants at decent context length. Basically the setup we 24 GB VRAM folks are actually using, making the results actually grounded. I'm not in any position to talk shit about vLLM study, but it really looked like a "how to invest and become rich if you already have $1,000,000" book to me, with regular 4-bit and 5-bit quants missing from comparison.

Here are my findings:

PPL Hides the Tail, KLD Exposes It. Through q4_0, the entire PPL range stays under 0.01 above bf16. Even turbo3_tcq only adds ~0.02 PPL. But 99.9% KL divergence tells a different story: while q5_0 (at 34.4% of bf16) is obviously behind q8_0, it's still not bad. But then q4_0's tail KLD is 32% worse than q5_0's. Now this is what breaks your tool calls and JSON structure.
Rotation closed the gap at 4 bits. llama.cpp already applies random rotation to KV vectors before quantizing, which is the same basic trick TurboQuant uses. At 4 bits, turbo4 has no quality advantage over q4_0, saves almost no memory, and runs 17% slower. TurboQuant's value is at 2-3 bits where it has no alternatives anyways.
TCQ saves the low end. turbo3_tcq is consistently much better than plain turbo3, and turbo2_tcq is much better than turbo2. They are a legit solution for cases where you need aggressive compression. Now what is TCQ, you might ask? Luckily, the article covers this as well!
Asymmetric KV beats symmetric at the same size. q5_0/q4_0 is the same memory as q4_1/q4_1 but beats it across all test configs in 99.9% precision. After K reaches q5_0, the next useful bit goes to V, not to q5_1 K.
Higher model precision means more cache damage. Q5_K_S took 3-5% more 99.9% precision damage than IQ4_XS at the same cache quant. Model and KV cache quants are not independent, and it's better to balance their quants rather than focusing on only one or the other, as they both feed from the same VRAM pool.
q8 is mostly a luxury tier, unless you have spare VRAM. q8_0/q5_0 at 43.8% of bf16 KV keeps 99.9% precision at 93.7-98.2% across configs, so full q8_0/q8_0 at 53.1% is mostly validation when you don't struggle with VRAM anyways.

Here's the article, with all the data and quite a bit of analysis:
https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

u/Anbeeld — 12 hours ago

▲ 25 r/LocalLLaMA

New SOTA 1B model? HRM-text

Saw this video by them. Seems interesting but Tbh the benchmarks seem too good to be true. I'm not super knowledgeable on how models think so can anyone more knowledgeable explain what exactly is happening. And it's pros and cons?

GitHub: https: //github.com/sapientinc/HRM-Text Hugging face: https://huggingface.co/sapientinc/HRM-Text-1B

I'm not affiliated with them in anyway, just saw the video on YouTube.

youtu.be

u/vandalieu_zakkart — 9 hours ago

▲ 194 r/LocalLLaMA+2 crossposts

I built a new type of AI tool; it generates 3D objects composed of their constituent parts (instead of the monolithic solid blobs all 3D AI generators produce).

The video shows a washing machine with separate, functional internal parts. It's even shown animated, because of accurate internal hinge and socket design.

This is a new technique compared to how AI is currently used to generate 3D objects. State of the art 3D generators like Meshy or Tripo operate as if molding a 3D shape out of clay.

In contrast, my technique does not generate a 3D shape at all.

It generates code - which in turn runs, generating the 3D object you see. A byproduct of that approach is getting a 3D object with separate, functional parts (which is what we actually wanted).

The project is free and on github: https://github.com/RareSense/Nova3D

Some generated examples:
- Boston Dynamics-style robot dog: https://imgur.com/a/CqMYgrF
- Microwave (random, but shows part separation well): https://imgur.com/a/hIqIJdr
- Internal assembly generation: https://imgur.com/a/JxDZ7Wd

Would love to hear feedback.

u/mhb-11 — 12 hours ago

▲ 250 r/LocalLLaMA

got my first "rm -rf /" today

Agent decided to test if harmful command block worked by issuing a rm -rf /

Thankfully it worked so only damage was a mild heart attack.

I implemented a sandbox immediately afterwards.

EDIT: for those wondering, I was implementing a bash command whitelist and also bubblewrap for isolation. I did the whitelist implementation first and that was the command the agent chose to test it 😂 bwrap got done quickly afterwards!

reddit.com

u/DeltaSqueezer — 16 hours ago

▲ 39 r/LocalLLaMA

Nemotron-Labs-Diffusion from NVIDIA

Model Overview

Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.

https://preview.redd.it/mwyq7b7hx42h1.png?width=3915&format=png&auto=webp&s=744bd87267338a6236269a8d915b185cff8a82d2

Highlights

SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency.
Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation.
Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches:
- 3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang.
- 5.9× tokens per forward over Qwen3-8B (no MTP) with the same accuracy.
Real-device speed-up across platforms:
- DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16.
- GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x).
Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research.

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B-Base

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B-Base

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-3B-Base

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-3B

reddit.com

u/jacek2023 — 12 hours ago

▲ 667 r/LocalLLaMA+2 crossposts

New open source multimodal model does it all...with only 3b parameters

Lance is a lightweight native unified multimodal model that supports image and video understanding, generation, and editing within a single framework.

Efficient at 3B scale. With only 3B active parameters, Lance delivers strong performance across image generation, image editing, and video generation benchmarks.

huggingface.co

u/uxl — 18 hours ago

▲ 71 r/LocalLLaMA+1 crossposts

Public Repository "Codegraph" claims to reduce Claude, Cursor, Codex, and OpenCode API tool calls by 94% locally, an innovation that could directly offset the most recent Claude API pricing model.

Author Colbymchenry has developed a tool leveraging Claudes Explore Agents to utilize a pre-indexed knowledge graph — symbol relationships, call graphs, and code structure. Agents query the graph instantly instead of scanning files, which he declares reduces API tool calls by up to 94% while speeding up usage by roughly 77%.

Codebase Improvement Table:

Codebase	With CG	Without CG	Improvement
VS Code · TypeScript	3 calls, 17s	52 calls, 1m 37s	94% fewer · 82% faster
Excalidraw · TypeScript	3 calls, 29s	47 calls, 1m 45s	94% fewer · 72% faster
Claude Code · Python + Rust	3 calls, 39s	40 calls, 1m 8s	93% fewer · 43% faster
Claude Code · Java	1 call, 19s	26 calls, 1m 22s	96% fewer · 77% faster
Alamofire · Swift	3 calls, 22s	32 calls, 1m 39s	91% fewer · 78% faster
Swift Compiler · Swift/C++	6 calls, 35s	37 calls, 2m 8s	84% fewer · 73% faster

github.com

u/NetTechMan — 13 hours ago

▲ 64 r/LocalLLaMA

Carbon: Decoding the Language of Life

https://preview.redd.it/rajj11v7j42h1.png?width=1744&format=png&auto=webp&s=72381de22a9bac4b30a59498d549bb09df075df3

Hey, it's loubna from Hugging Face. Very happy to share our latest release: Carbon 🧬, a family of open DNA foundation models. Carbon-3B matches the current SOTA (Evo2-7B) while being 275x faster.

We borrowed a lot from how modern LLMs are trained and from our SmolLM work, but DNA isn't language. Genomes are noisy, redundant, and shaped by evolution rather than communication. So we adjusted the recipe:

Tokenizer. Most genomic models tokenize at the nucleotide level, which blows up sequence length. BPE is the obvious LLM-style fix, but it doesn't behave well on DNA. We use deterministic 6-mer tokens (one token = 6 nucleotides): 6× shorter sequences and cheaper attention.

Training loss. With 6-mer tokens, cross-entropy scores a prediction that gets 5 of 6 nucleotides right the same as one that's completely wrong. This gets brittle late in training and produces loss spikes. We switch mid-training to a more flexible factorized loss (FNS).

Data. Genomes are mostly sparse, repetitive background. We curate down to a staged functional DNA + mRNA mixture, with every ratio chosen by ablation. Like mixing a web corpus, but for biology.

- Technical report: https://github.com/huggingface/carbon/blob/main/tech-report.pdf
- Demo (with a biology primer for our ML friends): https://huggingface.co/spaces/HuggingFaceBio/carbon-demo

Happy to answer questions in the comments 🤗

reddit.com

u/loubnabnl — 13 hours ago

▲ 13 r/LocalLLaMA

Qwen3.6:27B VRAM 16GB 5080: MTP Quant, Speeds, and Configs

For those of you running Qwen3.6:27B on 16GB VRAM, what quantization did you settle on?

For my primary purpose as a HA voice assistant, I've found my ideal target to be >50 tg and >800 pp. Qwen3.5:9B works really fast, but I'm experimenting with higher intelligence. Offloaded the vision model to CPU because it is infrequently used.

Currently running Qwen3.6-27B-Q3_K_S.gguf with 64 layers on GPU at the following speeds:

prompt eval time =     462.66 ms /   507 tokens (    0.91 ms per token,  1095.83 tokens per second)
       eval time =   18710.17 ms /   884 tokens (   21.17 ms per token,    47.25 tokens per second)
      total time =   19172.84 ms /  1391 tokens
draft acceptance rate = 0.59677 (  481 accepted /   806 generated)

prompt eval time =    6001.34 ms /  8561 tokens (    0.70 ms per token,  1426.51 tokens per second)
       eval time =    2404.46 ms /   147 tokens (   16.36 ms per token,    61.14 tokens per second)
      total time =    8405.80 ms /  8708 tokens
draft acceptance rate = 0.80357 (   90 accepted /   112 generated)

Config:

      -m /models/Qwen3.6-27B/Qwen3.6-27B-Q3_K_S.gguf
      --mmproj /models/Qwen3.6-27B/mmproj-BF16.gguf
      --no-mmproj-offload
      --host 0.0.0.0
      --port 8080
      --jinja
      -fa on
      --temp 0.6
      --top-p 0.95
      --top-k 20
      --min_p 0.0
      --presence-penalty 1.5
      --repeat-penalty 1.0
      --cache-ram 0
      --fit on
      -np 2
      --fit-ctx 32000
      --cache-type-k q8_0
      --cache-type-v q8_0
      --cache-type-k-draft q8_0
      --cache-type-v-draft q8_0
      --log-verbosity 4
      --chat-template-kwargs '{"preserve_thinking": true}'
      --spec-type draft-mtp
      --spec-draft-n-max 2

reddit.com

u/InternationalNebula7 — 10 hours ago

▲ 109 r/LocalLLaMA

Time to update llama.cpp to get som MTP improvements!

https://github.com/ggml-org/llama.cpp/pull/23269

reddit.com

u/PixelatedCaffeine — 18 hours ago

▲ 44 r/LocalLLaMA

The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b

One way I like to test new models, is by one-shoting (with a good prompt) a single webpage clone of the classic arcade game pacman. I usually do 3 attempts and keep the best one. So far all of them, including anthropic, chatgpt and google models, have failed, most of them miserably. The best one until now was GLM 5.1

That was until I tried it with Qwen 3.6 27b F16. Out of 3 attempts, 2 were the best by far, with the top result only having minor errors! However, as soon as I dropped to 8bit quantisation, I could not replicate those good results even after trying 5+ times. This goes to show what I have saying for a long time, based on my experience: there is a world of difference between a 16bit and a 8bit quant, despite most people claiming it is lossless, or nearly lossless.

The results were so good, and since it just happened that I was testing the llama.cpp MTP speculative decoding PR (not yet merged at that time) with my own quants, and developing my own fixed jinja chat template for Qwen 3.5/3.6, I thought why not try to push Qwen 3.6 27b F16 through a proper agentic coding workflow. I think the results were brilliant, and they speak for themselves. You can try the full single page game here:

https://guigand.com/pacman

Lessons learned and observations:

* A good chat template is critical. The official chat template was unusable due to it being only targeted at vLLM, and therefore full of errors in other tools. I started with community templates, which were improvements, but still had many quirks. This is why I started fixing the bugs one by one in the official templates, and slowly improving it. The beginning of the agentic sessions were painful due to many quirks and errors. But slowly it improved, and once I got the template well tuned, it felt like I had unlocked a new level of intelligence in the model.

* MTP speculative decoding does not accelerate all tasks identically. Basically it is most efficient at deterministic task like coding, and least at creative tasks like brainstorming. I wrote about it here: https://www.reddit.com/r/LocalLLaMA/comments/1t9gcar/mtp_benchmark_results_the_nature_of_the/ - For this pacman development, my generative tok/s varied between 8 tok/s and 18 tok/s depending on the task. For reference, without MTP, I get 6.6 tok/s with the same model and quant.

* Not all harnesses are equals both in terms of code quality but also in terms of impact on speed. Most of use already know that the coding harness has a huge impact on quality, with Claude Code being considered the gold standard; this is what I use for normal daily coding. In this case I started with Qwen CLI, mostly because of the chat template problems, on the principle that if there was one harness more likely to better handle Qwen LLM specifics, it would be their own harness. I was actually pleasantly surprised, and Qwen CLI delivered far beyond what I was expecting! In the later stages, I switched back to Claude Code, mostly to verify that the final chat template was working properly there too. I did not notice any improved process or code quality. What I noticed though, is that developing in Claude Code was a lot slower than in Qwen CLI! This is due to all the extra prompts built within Claude Code. With a local model that has such a slow tok/s, it can make the difference between being usable, and between being borderline hair pulling...

* Context management and caching is super efficient in this model. Do not interfere with it. It works great, let it do its thing. Do not use any skill, plugin, etc, that manipulates the cache or context. This will result in confusing the model and making it a lot dumber and error prone.

* Tool calls, context compaction, shell usage, subagents, parallel subagents, work flawlessly. Initially it did not though, and it took me a long time and lots of work to get it right through chat template fixes and improvements. I actually only used context compaction for testing, and it was fine, as usual in Claude Code.

* High context is usable without too much degradation. Maximum context size is 256k tokens I believe. Most of the time I planned the tasks to stay below 100k, but there were a few times I pushed it slightly over 150k. I did notice slightly reduced capabilities, but nothing major. The main reasons why I tried to keep it low is to get the best reasoning capabilities, as with all other models, but also speed started to decrease as the context usage grew.

* Apart from Gemini, this is the first model that impressed me with its audio knowledge. As a composer, musician, psychoacoustic scientist, and audio engineer, I pay a lot of attention to good audio. In this case, I tasked it to do some advanced audio manipulation and creation. All the audio in the game comes from Qwen having programmed the web audio synthesizer in a highly advanced and complex way. This is not midi, not simple wavetables, not samples. It takes into account psychoacoustic properties tuned to human hearing, with the use of harmonics, distorsion, layers, various effects. Truly impressive work. The only exception is the waka-waka sound, for which I had to make it use a sample (the same method was used in the original arcade game).

* I can live with slow token generation speed. I used to think that I needed a minimum of 70 to 80 tok/s for viable development. But this was usable, gave me time to do other things in parallel, and also to better reflect on the agentic tasks. I would probably not use it for large projects, with my current hardware, but for small to medium project, it is definitely acceptable.

If you read until here, let me know what you think, and I hope you enjoy the game.

Dev environment: macOS, apple silicon M2 max, 96GB RAM, llama.cpp server with OpenAI and Anthropic API endpoints.

>Edit: Qwen Code has a default timeout of 8 mins, and a default maximum response size of 8000 tokens. With a slower model., like this one, I was getting frequent timeouts initially. And with large planning/brainstorming/coding sessions, I was occasionally getting the response truncated, which required reprocessing. I solved it my making the following changes to my ~/.qwen/settings.json file:

  "modelProviders": {
    "openai": [
      {
        ...
        "generationConfig": {
          ...
          "timeout": 1800000,
          "maxRetries": -1,
          "samplingParams": {
            "max_tokens": 32768
          }
        }
      }
    ]
  },

u/ex-arman68 — 16 hours ago

▲ 763 r/LocalLLaMA

Qwen is cooking hard

I am waiting for 122B and new 27B

u/jacek2023 — 24 hours ago

▲ 73 r/LocalLLaMA

Meet the Fleet of BlackBeard

My website is currently down, sorry about that.

Here is my current full AI homelab setup:

#0 i3 7100, 32gb ddr4, 2x8tb archive nas (Archiving models here. Can serve them via samba nas if needed without needing to download them again.)

#1 ryzen 5600, 64gb ddr4, gtx1070 (Privateer, works surprisingly fast at running 35b a3b)

#2 ryzen 5950x, 128gb ddr4, rtx5060ti, strix x570f, Asus TUF gt502 (Manowar, can expand to 2x5060ti, without problems. tried with 2x3090, it overheats.)

#3 ryzen 9950x3d, 256gb ddr5, rtx5090, gigabyte ai top b850, corsair air5400 (Capt.'s ship. can have one more 3090 there, tried it works fine.)

#4 threadripper 1950x, 128gb ddr4, 4 x 3090, gigabyte designare x399 (the Kraken!, -still building this, waiting for the risers to arrive)

All of them are running on... Linux Mint 22. I will probably go buy 10gbE cards later to connect all of them together on a pentagram to summon some demonlord

u/BlackBeardAI — 18 hours ago

r/LocalLLaMA

Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences?

LM Studio finally added support for MTP Speculative Decoding

48GB VRAM users, what are your daily drivers? Do you wish you had more VRAM? What would you run if you did?

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

⚡️ The Technical Breakthroughs

🖥️ Budget Hardware Specs

Newbie vibe coding experience: Shifting from Claude Sonnet 4.6 to Qwen3.6-35B-A3B-UD-Q6_K

Intel's Crescent Island PCB Leaks, Showing a Massive Xe3P GPU, 16-Pin Connector, 160GB LPDDR5X as Intel Sidesteps the HBM Shortage

Google AI Edge Gallery v1.0.13 &amp; v1.0.14 updates: Gemma 4 Multi-Token Prediction, Pixel TPU support, experimental MCP, new skills, now saves chat history

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

New SOTA 1B model? HRM-text

I built a new type of AI tool; it generates 3D objects composed of their constituent parts (instead of the monolithic solid blobs all 3D AI generators produce).

got my first "rm -rf /" today

Nemotron-Labs-Diffusion from NVIDIA

Highlights

New open source multimodal model does it all...with only 3b parameters

Public Repository "Codegraph" claims to reduce Claude, Cursor, Codex, and OpenCode API tool calls by 94% locally, an innovation that could directly offset the most recent Claude API pricing model.

Carbon: Decoding the Language of Life

Qwen3.6:27B VRAM 16GB 5080: MTP Quant, Speeds, and Configs

Time to update llama.cpp to get som MTP improvements!

The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b

Qwen is cooking hard

Meet the Fleet of BlackBeard

Google AI Edge Gallery v1.0.13 & v1.0.14 updates: Gemma 4 Multi-Token Prediction, Pixel TPU support, experimental MCP, new skills, now saves chat history