r/unsloth

▲ 70 r/unsloth

4-bit Qwen3.6 MTP GGUF cited 70+ websites with one prompt!

4-bit Qwen3.6 MTP GGUF managed to search 70+ sites from a single prompt.

Try this locally with Unsloth Studio on 20GB RAM.

Unsloth now supports automatic MTP + speculative decoding for supported models. Unsloth also now auto-selects the best MTP settings for your specific device (Mac, CPU, GPU etc.)

We also fixed many bugs and issues including tokens/s not showing up correctly and MTP not being applied properly.

GitHub: https://github.com/unslothai/unsloth

u/yoracale — 18 hours ago
▲ 54 r/unsloth+1 crossposts

RX 7900 XTX vs Radeon AI PRO R9700 — llama.cpp Vulkan vs ROCm (6 models, token-gen)

Setup: llama.cpp llama-bench, -fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -p 512,2048 -n 128,256 -r

3, 300 W power cap on both cards. Models are unsloth GGUFs (UD-IQ4_XS / UD-Q4_K_XL);

gpt-oss-20b is the ggml-org native MXFP4. R9700 = RDNA4/gfx1201, 7900 XTX = RDNA3/gfx1100.

R9700 runs measured one day earlier, identical config.

Takeaways:

- 7900 XTX beats the R9700 by +24–29% on token-gen across the whole slate — memory

bandwidth (384-bit vs 256-bit).

- Vulkan > ROCm for token-gen on both architectures — huge on MoE (XTX: +33–64%).

- Prefill flips it: ROCm pp2048 is ~8–17% faster on dense models (e.g. Qwen-27B IQ4: ROCm

1022 vs Vulkan 870 t/s).

greetings Ginmarr

u/Ginmarr — 20 hours ago
▲ 28 r/unsloth

PinchBench and Tau2 may matter more than one more AIME headline

For agent models, PinchBench and Tau2 may matter more than one more AIME headline。

I still think AIME and GPQA matter. They say something real about capability ceilings. For agent models, though, I reach first for execution-heavy, tool-heavy, multi-step signals. That is why Ring-2.6-1T caught my eye: PinchBench: 87.60, Tau2-Bench Telecom: 95.32, and ClawEval: 63.82 sit alongside AIME 26: 95.83, GPQA Diamond: 88.27, and ARC-AGI-V2: 66.18. For production-style agents, I care first about whether the model can keep a workflow moving, coordinate tools cleanly, and avoid spending deep reasoning on every intermediate step. The public high / xhigh framing fits that story too, with deeper reasoning available when you need it instead of dominating every path.

reddit.com
u/Tricky_Season2969 — 18 hours ago

I made my own organization on huggingface for soley releasing low size distills of bigger models

I recently started my own Hugging Face org called CoNDeNse-AI focused on making smaller, lightweight distilled AI models that are easier to run on normal hardware 🙌

Org: https://huggingface.co/CoNDeNse-AI

Most of the training is done on Kaggle using 2x T4 GPUs, so a big part of the project is figuring out how to get the best possible results from limited hardware. Because of this, we unfortunately can’t currently make proper distills based on newer/larger Qwen 3.5 base models since Kaggle struggles heavily with them during training and distillation.

Some current projects are:

- GLM-5.1-Qwen3-1.7B-CoNDeNse

- GLM-5.1-Qwen3-0.6B-CoNDeNse

- GLM-5.1-Qwen3-1.7B-CoNDeNse-GGUF

The 1.7B versions mainly focus on preserving reasoning, coding, and multilingual capabilities while reducing overhead, while the 0.6B variant is more focused on accessibility and lower-end hardware support. The GGUF release is aimed at easier local inference in things like llama.cpp and LM Studio 💻

The org is still very experimental, so alongside proper releases there are also research checkpoints, quantization tests, and random experiments that may or may not work 😅

Would love feedback from people working on low-resource training/distillation setups.

u/Capital_Savings_9942 — 17 hours ago
▲ 597 r/unsloth+1 crossposts

Run Qwen3.6 MTP GGUFs locally!

Hey guys, Qwen3.6 can run ~1.4–2.2× faster with no accuracy change due to MTP. You can run this locally on just 18GB RAM, VRAM or unified memory.

The Qwen3.6 Unsloth GGUFs are now out of experimental mode, llama.cpp has merged many PRs, and MTP is now properly supported in Unsloth. MTP is now ready!

Please use the latest Unsloth `v0.1.41-beta`, not `v0.1.405-beta` which is older.

Qwen3.6-27B MTP can run at 160 tokens/s. Qwen3.6-35B-A3B MTP GGUF reaches 240 tokens/s. We also uploaded MTP GGUFs for Qwen3.5!

27B MTP GGUF: https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF
35B-A3B MTP GGUF: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF

Guide: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

Thank you! We've got lots of releases this week as well.

u/yoracale — 2 days ago
▲ 145 r/unsloth

Run Qwen3.6 MTP GGUFs in Unsloth Studio!

Hey guys, Qwen3.6 MTP GGUFs now work in Unsloth Studio: https://github.com/unslothai/unsloth

Just update Unsloth Studio or do a fresh install.

MacOS, Linux, WSL:

curl -fsSL https://unsloth.ai/install.sh | sh

Windows PowerShell:

irm https://unsloth.ai/install.ps1 | iex

As always huge thanks to llama.cpp and devs for making this possible.

We'll be doing a new pypi release with lots of new updates tomorrow! Lots!!!

u/yoracale — 3 days ago

최신 unsloth studio 버전에서 한글 입력이 안되는데 해결한 사람 있나?

unsloth studio를 업데이트 한 이후부터 프롬프트 입력에서 한글 입력이 전혀 안된다.
운영 환경은 mac mini

unsloth studio의 버그야? 아니면 한글 같은 다국어 입력 문제 해결 방법이 있나??

reddit.com
u/More-Sail-6170 — 2 days ago
▲ 33 r/unsloth

Any plans to update Qwen3 CoderNext with MTP?

The Unsloth team has been truly amazing in breadth and depth of releases. I’m super excited to try the 27b in particular.

The Qwen3 CoderNext model has actually surprised me in functionality when thinking is less valuable, like feeding Aider.

I would be grateful if this got MTP turned on with Unsloth’s high quality quants!!

Anyone else a fan?

reddit.com
u/fixedupperfan — 3 days ago
▲ 435 r/unsloth

Qwen3.6 MTP Unsloth GGUFs now 1.8x faster!

Qwen3.6 MTP Unsloth GGUFs now run **1.8x faster, increased from 1.4x just two days ago!**This is due to llama.cpp adding --spec-draft-p-min 0.75!

Args have also changed from
--spec-type mtp
to
--spec-type draft-mtp

Also increase --spec-draft-n-max 2 to 6

We also released Qwen3.5-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon!

For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well.

Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673

Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

u/danielhanchen — 5 days ago

Unsloth Studio is only loading into RAM

I finally got around to installing Unsloth Studio and am giving it a whirl on my Windows machine. I have an RTX6k and am trying to load the latest QWEN MTP model you guys created. However no matter what I do, all models are loading in RAM instead of the GPU.

From what i see online it means there's a cuda driver mismatch most likely. But i installed 12.9 and updated all my environment variables. So I'm not sure why this is still happening.

Any thoughts or help?

reddit.com
u/Demonicated — 4 days ago
▲ 253 r/unsloth

Ring-2.6-1T has been open-sourced!!!

​

Ring-2.6-1T is a 1T-parameter-scale thinking model with 63B active parameters, built for real-world agent workflows that require both strong capability and operational efficiency. It is optimized for coding agents, tool use, and long-horizon task execution, delivering leading results on benchmarks including PinchBench, ClawEval, TAU2-Bench, and GAIA2-search.

With adaptive reasoning effort across high and xhigh modes, Ring-2.6-1T dynamically allocates reasoning budget based on task complexity. This enables stronger performance with lower token overhead, especially in tool-heavy and multi-turn agent workflows.

Ring-2.6-1T is designed for advanced coding agents, complex reasoning pipelines, and large-scale autonomous systems where execution quality, latency, and cost efficiency all matter.

https://huggingface.co/inclusionAI/Ring-2.6-1T

u/rulingarayashiki — 6 days ago
▲ 309 r/unsloth

Qwen3.6 MTP Unsloth Experimental GGUFs

Hey guys, some of you may seen our Qwen3.6 MTP GGUFs. MTP (Multi Token Prediction) speculative decoding enables models like Qwen3.6 to have ~1.4-2x faster generation with no change in accuracy. This enables Qwen3.6 27B and 35B-A3B to have >1.4x speed-up over the original baseline which is especially useful for local models.

Qwen3.6 27B can now do 140 tokens / s generation and Qwen3.6 35B-A3B 220 tokens / s generation! See MTP Benchmarks for more details.

Regarding draft tokens, we found 2 to be the best. The acceptance rate defs drops, so it's probs best in general to stick with 2. For coding, maybe 3 will work fine since more tokens probs gets accepted

You must use the specific llama.cpp PR branch which we give instructions for in our guide below. Unsloth Studio will support it once the PR is merged.

We're now uploading MTP quants for Qwen3.5 smaller models. Thank you!

u/yoracale — 7 days ago

Tried Qwen3.6-35B-A3B-MTP-GGUF:UD-Q2_K_XL on Patherlake x9 388h,b390

Got a new laptop Lenovo slim 7i (32gb ram)
was thinking if it can run this

Prompt - Write a 500 word explanation of Rust ownership with examples.

[ Prompt: 13.5 t/s | Generation: 6.7 t/s ]

reddit.com
u/unluckybitch18 — 6 days ago
▲ 11 r/unsloth

Looping issue with MTP on Qwen3.6

Hi i have looping issue when i try the new MTP branch version of llama.cpp
My config:

[*]

chat-template-kwargs = {"preserve_thinking":true}

reasoning-budget = 4096

reasoning-budget-message = "Reasoning budget reached. Conclude the analysis and provide the final answer."

device = Vulkan1

gpu-layers = all

no-mmproj-offload = 1

batch-size = 2048

ctx-size = 128000

ubatch-size = 512

temp = 0.6

top-p = 0.95

top-k = 20

min-p = 0.00

presence-penalty=0.0

repeat-penalty=1.0

cache-prompt = 1

timeout = 600

reasoning = on

image-min-tokens = 1024

metrics = 1

fit-target = 0

no-mmap = 1

jinja = 1

prio = 3

reasoning = on

no-warmup = 1

parallel = 1

flash-attn = on

port = 8001

threads = 16

threads-batch = 16

cache-type-k = q8_0

cache-type-v = q8_0

kv-unified = true

ctx-checkpoints = 64

checkpoint-every-n-tokens = 2048

cache-ram = 20480

mlock = 1

main-gpu = 1

verbose=1

[Qwen3.6-27B-MTP-UD-Q6_K]

model = C:\Users\user\.cache\huggingface\hub\models--unsloth--Qwen3.6-27B-MTP-GGUF\snapshots\53b097416d6346f849b530e4bc1b5590dfe9d758\Qwen3.6-27B-Q6_K.gguf

mmproj = C:\Users\user\.cache\huggingface\hub\models--unsloth--Qwen3.6-27B-MTP-GGUF\snapshots\53b097416d6346f849b530e4bc1b5590dfe9d758\mmproj-BF16.gguf

cache-type-k = q4_1

cache-type-v = q4_1

spec-type = draft-mtp

spec-draft-n-max = 2

---------

i can see in terminal the LLM looping

[53923] srv update_slots: run slots completed [53923] que start_loop: waiting for new tasks [53923] que start_loop: processing new tasks [53923] que start_loop: processing task, id = 1798 [53923] que start_loop: update slots [53923] srv update_slots: posting NEXT_RESPONSE [53923] que post: new task, id = 1799, front = 0 [53923] slot get_n_draft_: id 0 | task 0 | max possible draft: 15217 [53923] slot update_batch: id 0 | task 0 | generate_draft: id=4013, #tokens=20320, #draft=1, pos_next=20320 [53923] srv update_slots: decoding batch, n_tokens = 2 [53923] set_adapters_lora: adapters = 0000000000000000 [53923] adapters_lora_are_same: adapters = 0000000000000000 [53923] set_embeddings: value = 1 [53923] slot update_slots: id 0 | task 0 | restoring speculative checkpoint (pos_min = 20319, pos_max = 20319, size = 748) [53923]

srv update_slots: run slots completed [53923] que start_loop: waiting for new tasks [53923] que start_loop: processing new tasks [53923] que start_loop: processing task, id = 1799 [53923] que start_loop: update slots [53923] srv update_slots: posting NEXT_RESPONSE [53923] que post: new task, id = 1800, front = 0 [53923] slot get_n_draft_: id 0 | task 0 | max possible draft: 15217 [53923] slot update_batch: id 0 | task 0 | generate_draft: id=4013, #tokens=20320, #draft=1, pos_next=20320 [53923] srv update_slots: decoding batch, n_tokens = 2 [53923] set_adapters_lora: adapters = 0000000000000000 [53923] adapters_lora_are_same: adapters = 0000000000000000 [53923] set_embeddings: value = 1 [53923] slot update_slots: id 0 | task 0 | restoring speculative checkpoint (pos_min = 20319, pos_max = 20319, size = 748) [53923]

srv update_slots: run slots completed [53923] que start_loop: waiting for new tasks [53923] que start_loop: processing new tasks [53923] que start_loop: processing task, id = 1800 [53923] que start_loop: update slots [53923] srv update_slots: posting NEXT_RESPONSE [53923] que post: new task, id = 1801, front = 0 [53923] slot get_n_draft_: id 0 | task 0 | max possible draft: 15217 [53923] slot update_batch: id 0 | task 0 | generate_draft: id=4013, #tokens=20320, #draft=1, pos_next=20320 [53923] srv update_slots: decoding batch, n_tokens = 2 [53923] set_adapters_lora: adapters = 0000000000000000 [53923] adapters_lora_are_same: adapters = 0000000000000000 [53923] set_embeddings: value = 1 [53923] slot update_slots: id 0 | task 0 | restoring speculative checkpoint (pos_min = 20319, pos_max = 20319, size = 748) [53923]

----

Does somebody also has this issue, better yet, does have somebody solution? This loops until timeout

reddit.com
u/Easy-Ride3366 — 6 days ago
▲ 4 r/unsloth+1 crossposts

Newbie: Can I use unsloth to load any model on hugging face?

In a project I've been asked to load models and do inference in my app directly with unsloth.
This the model:Qwen/Qwen3-ASR-0.6B · Hugging Face

Is it possible or do I "push back" like claude told me to.

u/AnakinVader066 — 5 days ago
▲ 469 r/unsloth

Unsloth joins PyTorch Ecosystem!

Hey guys, we're super excited to announce that Unsloth has officially joined the PyTorch Ecosystem! 🔥🦥

In case you didn't know, Unsloth is an open-source project that makes training & running models more accurate and faster with less compute. Our mission is to make local AI accessible to everyone. Unsloth will remain as an independent open-source project, separate from the PyTorch Foundation.

Blog: https://unsloth.ai/blog/pytorch

GitHub: https://github.com/unslothai/unsloth

Thanks to all of you for making this possible! 💕

u/yoracale — 9 days ago
▲ 33 r/unsloth+2 crossposts

I wrote a paper on HoloKV: Using CDMA Phase-Shifting to achieve O(N/k) KV-Cache Compression. Looking for Triton/CUDA collaborators.

Hey everyone,

I’m a 22-year-old independent researcher, and I’ve been trying to tackle the "Memory Wall" for long-context LLMs. Standard methods either quantize precision (which hits a hard limit) or use token eviction (which degrades reasoning).

I just published an open research draft for a different geometric approach called HoloKV.

The concept: Instead of appending new memory slots, HoloKV multiplexes (stacks) k tokens into a single physical memory slot. It uses deterministic +1/-1 orthogonal phase keys (inspired by CDMA telecommunications) to separate the signals.

To make it work natively with modern architectures, I introduced:

  1. Variance Normalization: A sqrt(k) penalty to prevent Softmax entropy collapse caused by superimposing vectors.
  2. Strict Even-Boundary Rule: A constraint on phase-key generation that perfectly preserves the 2D rotary commutative math of RoPE (Llama/Qwen).
  3. LoRA Denoising: Injecting Query/Value LoRA adapters via Knowledge Distillation to natively filter out the Gaussian background static.

The Ask:
I have successfully built the mathematical simulator in PyTorch to prove the orthogonal extraction and RoPE preservation work. However, I am a solo dev working on a GTX 1650. To actually realize the 75%+ physical VRAM savings, this needs a custom SRAM Active Accumulation Buffer written in OpenAI Triton or CUDA to prevent the "Read-Modify-Write" penalty.

I am open-sourcing the math and the paper. If there are any Triton/FlashAttention kernel engineers here who want to collaborate and help me build the hardware kernel, please reach out or open a PR!

**Paper & Code:**https://github.com/0sami0/HoloKV

github.com
u/5anez — 7 days ago