Nemotron-Labs-Diffusion from NVIDIA

Model Overview

Nemotron-Labs-Diffusion is a tri-mode language model that supports both AR decoding and diffusion-based parallel decoding by simply switching the attention pattern of the same model during inference. The synergy between these two modes enables a third mode, called self-speculation: the same model performs diffusion-based parallel drafting and AR verification with shared KV cache, achieving high acceptance lengths and decoding efficiency. The seamless mode switching by simply changing attention patterns enables high efficiency at different concurrency levels in varying deployment scenarios with one single model.

https://preview.redd.it/mwyq7b7hx42h1.png?width=3915&format=png&auto=webp&s=744bd87267338a6236269a8d915b185cff8a82d2

Highlights

SOTA 3B, 8B, 14B dense LM family (base, instruct, and vision-language variants) supporting AR, diffusion, and self-speculation with the focus on decode efficiency.
Generation moved from a memory-bound regime toward a compute-bound regime. Model weights are loaded once and reused to compute multiple tokens during generation.
Self-speculation uses diffusion for drafting and AR for verification, providing a stronger alternative to MTP approaches:
- 3x higher acceptance length and 2.2x speed-up vs. Qwen3-8B-Eagle3 in SGLang.
- 5.9× tokens per forward over Qwen3-8B (no MTP) with the same accuracy.
Real-device speed-up across platforms:
- DGX Spark (8B, concurrency 1): 2.7x faster with 112 tok/sec vs. 41.8 tok/sec AR using w4a16.
- GB200 (8B, concurrency 1): 3.3x faster with 850 tok/sec vs. 253 tok/sec AR and 360 tok/sec Eagle3. Custom CUDA kernels boost to 1015 tok/sec (4x).
Diffusion speedup-of-light analysis shows that throughput can be further doubled (vs. current best) for a single user with better sampling - future research.

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B-Base

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-14B

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B-Base

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-3B-Base

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-3B

reddit.com

u/jacek2023 — 13 hours ago

▲ 765 r/LocalLLaMA

Qwen is cooking hard

I am waiting for 122B and new 27B

u/jacek2023 — 1 day ago

▲ 55 r/dio

Lord Of The Last Day

let's listen to some good music

youtu.be

u/jacek2023 — 5 days ago

▲ 13 r/progmetal

A Forest of Stars - Sway, Draped In Vague

the more I listen to the new album by A Forest of Stars the better it sounds

youtu.be

u/jacek2023 — 6 days ago

▲ 63 r/LocalLLaMA

server, webui: support continue generation on reasoning models by ServeurpersoCom · Pull Request #22727 · ggml-org/llama.cpp

now you can CONTINUE

github.com

u/jacek2023 — 3 days ago

▲ 1.5k r/Polska

Relaks

z X

u/jacek2023 — 9 days ago

▲ 126 r/Opeth

it's boring guys, stop it

I am following this sub for Opeth topics, not for "funny" images

reddit.com

u/jacek2023 — 12 days ago

▲ 17 r/LocalLLaMA

https://huggingface.co/XiaomiMiMo/MiMo-V2.5

Model Summary

Architecture: Sparse MoE (Mixture of Experts), 310B total / 15B activated parameters
Context Length: Up to 1M tokens
Modalities: Text, Image, Video, Audio
Vision Encoder: 729M-param ViT (28 layers: 24 SWA + 4 Full)
Audio Encoder: 261M-param Audio Transformer (24 layers: 12 SWA + 12 Full)
Multi-Token Prediction (MTP): 329M parameters, 3 layers

u/jacek2023 — 13 days ago

▲ 6 r/doommetal+1 crossposts

new Draconian

u/jacek2023 — 13 days ago

▲ 64 r/LocalLLaMA

Qwen/Qwen3.6-35B-A3B was released 22 days ago

Qwen/Qwen3.6-27B was released 15 days ago

Let's predict when we can expect the 9B and 122B versions

reddit.com

u/jacek2023 — 13 days ago

▲ 108 r/LocalLLaMA

and they are not happy for some reason

the image shows the search result for: Google Chrome 4GB (it's possible that these articles are bullshit, I just know that some people found it in their Chrome)

u/jacek2023 — 14 days ago

▲ 20 r/LocalLLaMA

Here is the actual speed of Mistral Medium Q3 running locally on 3x3090

first some Python

https://preview.redd.it/3blnqya7o0zg1.png?width=1670&format=png&auto=webp&s=bab477f9889c16558044ccebb22e3ebfb6a56118

https://preview.redd.it/76a3j6u7o0zg1.png?width=1620&format=png&auto=webp&s=e302a90ae32a7d01959dfee5f7a921dc73ef20b5

https://preview.redd.it/xmd5tzj8o0zg1.png?width=1276&format=png&auto=webp&s=45bc1d77391da81049b6f026dcf6a4af40dc9ec3

then svg

https://preview.redd.it/8q5am5alo0zg1.png?width=1594&format=png&auto=webp&s=a7feeb832c17481526838e8488f4be3069f56443

https://preview.redd.it/u4mbv1klo0zg1.png?width=1600&format=png&auto=webp&s=7c83a3437c67ebefe1b0339861f05b9d67c6f030

https://preview.redd.it/e8vw83rlo0zg1.png?width=782&format=png&auto=webp&s=fadb4f04bba756056d38049c465d0f7a4323b66d

then html

https://preview.redd.it/zs9c36xbp0zg1.png?width=1626&format=png&auto=webp&s=428cb84d3158e4285eb4f1d47283646e876f55be

https://preview.redd.it/6dw74a5cp0zg1.png?width=1540&format=png&auto=webp&s=cc5af763d980329c0d98064e4f53265cfdf9ec2f

https://preview.redd.it/4s3zccecp0zg1.png?width=3796&format=png&auto=webp&s=6defbc181dcbee1fe4523559792e1642aaf504f8

https://preview.redd.it/30n07tlcp0zg1.png?width=3782&format=png&auto=webp&s=4ae343f915f4f70e48bc17add7ff856e1af5ceab

reddit.com

u/jacek2023 — 16 days ago

▲ 540 r/povertyLocalLLaMA+1 crossposts

https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

Mistral Medium 3.5 128B

Mistral Medium 3.5 is our first flagship merged model. It is a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights. Mistral Medium 3.5 replaces its predecessor Mistral Medium 3.1 and Magistral in Le Chat. It also replaces Devstral 2 in our coding agent Vibe. Concretely, expect better performance for instruct, reasoning and coding tasks in a new unified model in comparison with our previous released models.

Reasoning effort is configurable per request, so the same model can answer a quick chat reply or work through a complex agentic run. We trained the vision encoder from scratch to handle variable image sizes and aspect ratios.

Find more information on our blog.

Key Features

Mistral Medium 3.5 includes the following architectural choices:

Dense 128B parameters.
256k context length.
Multimodal input: Accepts both text and image input, with text output.
Instruct and Reasoning functionalities with function calls (reasoning effort configurable per request).

Mistral Medium 3.5 offers the following capabilities:

Reasoning Mode: Toggle between fast instant reply mode and reasoning mode, boosting performance with test-time compute when requested.
Vision: Analyzes images and provides insights based on visual content, in addition to text.
Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, and Arabic.
System Prompt: Strong adherence and support for system prompts.
Agentic: Best-in-class agentic capabilities with native function calling and JSON output.
Large Context Window: Supports a 256k context window.

We release this model under a Modified MIT License: Open-source license for both commercial and non-commercial use with exceptions for companies with large revenue.

Recommended Settings

Reasoning Effort:
- 'none' → Do not use reasoning
- 'high' → Use reasoning (recommended for complex prompts and agentic usage) Use reasoning_effort="high" for complex tasks and agentic coding.
Temperature: 0.7 for reasoning_effort="high". Temp between 0.0 and 0.7 for reasoning_effort="none" depending on the task. Generally, lower means answer that are more to the point and higher allows the model to be more creative. It is a good practice to try different values in order to improve the model performance to meet your demands.

u/jacek2023 — 20 days ago

▲ 27 r/LocalLLaMA

another day with pi + gemma 26B

u/jacek2023 — 21 days ago

▲ 70 r/davinciresolve

https://forum.blackmagicdesign.com/viewtopic.php?f=21&t=235587&sid=4cddfcec2a8cd658797d0ff9da05b7c1

from forum:

We are pleased to announce the release of DaVinci Resolve Studio 21.0b2 . This release is available at no charge for existing customers from our support web site.

https://www.blackmagicdesign.com/support/family/davinci-resolve-and-fusion

For DaVinci Resolve 21.0b2, we have taken efforts to keep the project libraries compatible with DaVinci Resolve 20.3.2. While this allows you to access the project library with 20.3.2, individual projects created or opened in 21.0b2 will no longer be accessible in 20.3.2. We recommend a full project library backup as well as individual project backups before opening projects in 21.0b2.

What's New in DaVinci Resolve 21.0b2

The following features have been added or updated.

More consistent crop flip and rotate actions in the Photo page.
Improved handling of crop resolutions for NEF, CR3 and RAF.
Nikon NEF lens distortion and lens vignette controls.
Speed and quality improvements for IntelliSearch better mode analysis.
Improved face identification for IntelliSearch analysis on Windows ARM.
Improved default ease profiles for retime curves.
Keyframe editor uses visible clip extents for curve normalization.
Color font and emoji rendering in Fusion page is RCM aware.
Foveated rendering controls for Apple Vision Pro workflows.
Support for dragging multiple OGraf or Lottie assets into a timeline.
Action to switch to source viewer mode in the Photo page.
Double click effects to apply to current photo.
Photo album quick export now uses output color space.
Looks section in viewer tools in the bottom toolbar.
Navigate photo albums with left and right arrow key in Color page.
Trim pass render tag can be used in file name and path in the Deliver page.
Improved HEIC thumbnail refresh for photo albums on Windows and Linux.
Multiple Macro Editor improvements.
Address issue with exporting stills with original file name.
Address decode issues with some RAF and 12-bit NEF stills.
Address incorrect UltraNR noise profiles for images.
Address fine audio volume adjustment using shift key.
Address auto select controls not working for source timelines.
Address issue with world pose for Apple Vision Pro workflows.
Address stretched text issue with VR360 titles and subtitles.
Address black frames in visionOS Review output with MainConcept MV-HEVC encoder.
Address edge pixel artifacts in immersive video clips added to VR180 or VR360 timelines.
Improved default node positioning for imported and generated nodes.
Address multiple issues with Cryptomatte and 3d renderer.
Scripting API support to classify audio for media pool clips and folders.
Scripting API support for speaker detection in audio transcription.
General performance and stability improvements.

Minimum System Requirements for macOS

macOS 15 Sequoia or later.
8 GB of system memory or 16 GB when using Fusion.
At least 16 GB for advanced AI tools.
At least 32 GB for background rendering and analysis.
For monitoring, Blackmagic Design Desktop Video 12.9 or later.
Apple Silicon based computer.

Minimum System Requirements for Windows

Windows 10 Creators Update.
16 GB of system memory or 32 GB when using Fusion.
For monitoring, Blackmagic Design Desktop Video 12.9 or later.
Integrated GPU or discrete GPU with at least 4 GB of VRAM.
At least 16 GB VRAM for advanced AI tools.
At least 32 GB RAM and 12 GB VRAM for background rendering.
GPU which supports OpenCL 1.2 or CUDA 12.8.
AMD/Intel official drivers from your GPU manufacturer.
NVIDIA Studio driver 581.57 or newer.

Minimum System Requirements for Windows for Arm

Windows 11 for ARM.
Qualcomm Snapdragon X Elite series processor.
16 GB of system memory or 32 GB for 4K or when using Fusion.

Minimum System Requirements for Linux

Rocky Linux 8.6.
32 GB of system memory.
For monitoring, Blackmagic Design Desktop Video 12.9 or later.
Discrete GPU with at least 4 GB of VRAM.
At least 16 GB VRAM for advanced AI tools.
At least 32 GB RAM and 12 GB VRAM for background rendering.
GPU which supports OpenCL 1.2 or CUDA 12.8.
AMD official drivers from your GPU manufacturer.
NVIDIA Studio driver 580.119.02 or newer.

Rohit Gupta

DaVinci Resolve Software Development
Blackmagic Design

reddit.com

u/jacek2023 — 22 days ago

▲ 49 r/LocalLLaMA

https://huggingface.co/ggml-org/NVIDIA-Nemotron-3-Nano-Omni

NVIDIA Nemotron 3 Nano Omni is a multimodal large language model that unifies video, audio, image, and text understanding to support enterprise-grade Q&A, summarization, transcription, and document intelligence workflows. It extends the Nemotron Nano family with integrated video+speech comprehension, Graphical User Interface (GUI), Optical Character Recognition (OCR), and speech transcription capabilities, enabling end-to-end processing of rich enterprise content such as meeting recordings, M&E assets, training videos, and complex business documents. NVIDIA Nemotron 3 Nano Omni was developed by NVIDIA as part of the Nemotron model family.

This model is available for commercial use.

This model was improved using Qwen3-VL-30B-A3B-Instruct, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B, Qwen2.5-VL-72B-Instruct, and gpt-oss-120b. For more information, please see the Training Dataset section below.

u/jacek2023 — 22 days ago

▲ 698 r/LocalLLaMA

words of wisdom

u/jacek2023 — 22 days ago

▲ 49 r/LocalLLaMA

Tutorial from the Google guy,

I use very similar setup (llama.cpp instead of lmstudio)

u/jacek2023 — 23 days ago