NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B

NVIDIA researchers have released Nemotron-Labs-Diffusion, a language model family that unifies three decoding modes in one architecture. The model supports autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding. It is available in 3B, 8B, and 14B parameter sizes. The family includes base, instruct, and vision-language variants.

Sequential Decoding Limits Throughput

Standard autoregressive (AR) language models generate text one token at a time, left to right. Each token depends on all previous tokens. This sequential dependency limits GPU parallelism per generation step. The result is low hardware utilization at low batch sizes — the typical setting for single-user or edge deployment.

Diffusion language models (LMs) offer a different approach. Instead of generating tokens sequentially, they denoise multiple tokens in parallel per forward pass. This enables higher throughput. The tradeoff has been accuracy: diffusion LMs have consistently lagged behind AR models on benchmarks, requiring substantially more data to reach comparable performance. A key reason is that diffusion training treats all token permutations uniformly, rather than leveraging the strong left-to-right prior inherent in natural language.

https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL

What Is a Tri-Mode Language Model?

Nemotron-Labs-Diffusion is trained on a joint AR-diffusion objective. At inference time, it operates in three modes depending on the deployment context. There are no mode-specific architectural modifications — the same weights serve all three modes.

AR mode is standard left-to-right autoregressive decoding using causal attention. This mode is best suited for high-concurrency cloud serving.

Diffusion mode denoises multiple tokens in parallel within a fixed-length block. The sequence is partitioned into contiguous blocks. Within each block, tokens attend bidirectionally. Across blocks, attention remains causal, so prior blocks can reuse their KV cache. A lightweight trained sampler predicts, per masked position, whether the model’s top-1 prediction at the current denoising step is correct. Positions predicted as correct are committed in that step. This allows the model to commit multiple tokens per forward pass.

Self-speculation mode uses the diffusion pathway to draft candidate tokens and the AR pathway to verify them, within the same single model. No auxiliary draft model or separate prediction head is required. The diffusion pathway generates a block of k candidate tokens in parallel. The AR pathway then runs a second forward pass over those candidates using causal attention, verifying the longest contiguous prefix that matches AR predictions. Each cycle produces between 1 and k+1 verified tokens. This contrasts with Multi-Token Prediction (MTP) methods such as Eagle3, which use small auxiliary draft heads attached to an AR backbone.

Training

The joint training objective combines an AR next-token prediction loss and a block-wise diffusion denoising loss:

ℒ(θ) = ℒ_AR(θ) + α · ℒ_diff(θ)

The coefficient α is set to 0.3 across all training stages. Ablation experiments varying α from 0.1 to 1.0 show that both AR-mode and diffusion-mode accuracy peak at α = 0.3. No value in the range [0.1, 0.5] improves one mode at the expense of the other — the two objectives rise and fall together.

Two-stage training first trains the model purely on the AR objective for 1 trillion tokens, building strong left-to-right linguistic priors. Stage 2 then introduces the joint objective for 300 billion additional tokens. In ablations, two-stage training contributed +5.74% average accuracy. Adding the AR loss contributed the single largest gain at +7.48%. Global loss averaging — treating all tokens across a batch equally rather than averaging per-sequence first — contributed +2.12% by reducing gradient variance from variable diffusion masking ratios. Cumulatively, the full training pipeline improved the baseline by 16.05% average accuracy.

All models are initialized from pretrained Ministral3 base models, not trained from scratch. Training was performed on 256 NVIDIA H100 GPUs. Instruct models are trained via supervised fine-tuning (SFT) on 45 billion tokens on top of the base models, using the same joint AR-diffusion objective with α = 0.3. The training and inference pipeline is released through Megatron Bridge.

LoRA-Enhanced Linear Self-Speculation

The base diffusion-to-AR alignment in self-speculation can be improved with a LoRA adapter. This adapter is fine-tuned on the diffusion draft pathway to better align its output with the AR verifier. It targets only the o_proj layer of the attention module (rank 128, α = 512, approximately 36M trainable parameters, 0.4% of the backbone). LoRA tuning improves tokens per forward (TPF) by 14.4%, 32.5%, and 27.6% at the 3B, 8B, and 14B scales respectively, with negligible accuracy change.

Speed-of-Light Analysis

The research team reports a speed-of-light (SOL) analysis — a theoretical upper bound on tokens per forward pass achievable by the diffusion mode, assuming an oracle sampler that correctly identifies all positions that can be safely committed in parallel.

At block length 32, the SOL acceptance rate reaches 7.60× on average, exceeding 10× on coding and multilingual tasks. Current confidence-based sampling achieves approximately 3× TPF at comparable accuracy, leaving a large gap to the SOL ceiling.

Comparing against linear self-speculation: both approach similar acceptance rates (6.82× for linear self-speculation vs. 7.60× SOL). However, the real tokens per forward pass (TPF) gap is much larger — 6.02× for SOL versus 3.41× for linear self-speculation, a 76.5% difference. Linear self-speculation requires two forward passes per cycle (one diffusion draft, one AR verify) and accepts only a contiguous prefix. These two constraints cap its real TPF well below SOL, even when drafter and verifier are well aligned.

NVIDIA introduces Nemotron-Labs-Diffusion, a 3B/8B/14B model family achieving 5.99× tokens per forward over Qwen3-8B using self-speculation decoding. — https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report_v1.pdf?VersionId=db8_EMO8B.vmU26.jr7Le9pN3MqcUDNL

Benchmark Results

On the 10-task instruct evaluation (HumanEval, MBPP, LiveCodeBench-CPP, GSM8K, Math500, AIME24, AIME25, GPQA, IFEval, MMLU):

NLD-8B AR mode: 63.61% average accuracy, versus 62.75% for Qwen3-8B and 58.02% for Ministral3-8B-Instruct.

NLD-8B diffusion mode: 63.18% average accuracy with 2.57× TPF.

NLD-8B LoRA-tuned linear self-speculation: 62.81% average accuracy with 5.99× TPF.

NLD-8B quadratic self-speculation: 64.04% average accuracy with 6.38× TPF.

On SPEED-Bench with SGLang on an NVIDIA GB200 GPU, linear self-speculation achieves 4× higher throughput than Qwen3-8B and 3.3× speedup over the NLD-8B AR mode at concurrency 1 (3.97× with an optimized CUDA kernel). Compared to Qwen3-8B-Eagle3, linear self-speculation delivers a 2.4×, 2.3×, and 1.8× speedup at batch size 1 on GB200, RTX Pro 6000, and DGX Spark respectively.

Acceptance length is the underlying reason for this advantage. Across SPEED-Bench categories, NLD achieves average acceptance lengths of 5.46 (native) and 6.82 (with LoRA) tokens per draft step. Eagle3 averages 2.75 and Qwen3-9B-MTP averages 4.24. On the four diffusion-friendly categories — coding, math, reasoning, and multilingual — the gap widens further: 8.69 for NLD-LoRA versus 2.81 for Eagle3.

At 14B scale with LoRA-tuned linear self-speculation, NLD-14B achieves 66.36% average accuracy at 5.96× TPF, outperforming Qwen3-14B at 65.17% accuracy in AR mode.

The vision-language model, Nemotron-Labs-Diffusion-VLM-8B, extends the same framework to multimodal tasks. In linear self-speculation mode, it achieves 3.63× to 7.45× TPF — the higher end for responses over 200 tokens — with a 0.1% average accuracy drop versus AR mode.

Marktechpost’s Visual Explainer

Overview
Three Modes
Install
Basic Usage
Self-Speculation
Production Serving
When to Use

What is Nemotron-Labs-Diffusion?

A single model checkpoint. Three decoding modes. No architecture changes.

Nemotron-Labs-Diffusion is a language model family from NVIDIA that combines autoregressive (AR) decoding, diffusion-based parallel decoding, and self-speculation decoding in one set of weights. You switch modes at inference time by changing the attention pattern — no separate model files needed.

Sizes: 3B · 8B · 14B

Variants: Base · Instruct · VLM

Requires: transformers ≥ 5.0.0

License: NVIDIA Nemotron Open Model

5.99×

Tokens per forward vs Qwen3-8B (Linear Self-Speculation, 8B)

3.3×

Throughput over AR mode at concurrency 1 (GB200)

2.4×

Faster than Qwen3-8B-Eagle3 at batch size 1 (GB200)

63.61%

Avg accuracy, 8B AR mode vs 62.75% Qwen3-8B

The Three Decoding Modes

Same weights. Different attention pattern. Pick based on your deployment.

Mode 1

AR Decoding

Standard left-to-right generation using causal attention. One token per forward pass. Compatible with all existing AR serving infrastructure.

Best for: high-concurrency cloud serving where GPU compute is fully saturated by batching.

Mode 2

Diffusion Decoding

Denoises multiple tokens per block in parallel. Adjust the threshold value to trade accuracy for higher throughput. 2.57× TPF at threshold 0.9.

Best for: flexible accuracy–throughput tradeoff from one model.

Mode 3

Self-Speculation

Diffusion drafts k tokens in parallel. AR verifies them in a second pass. Accepts the longest matching prefix. No auxiliary model or extra heads needed.

Best for: low-concurrency or single-user inference where per-user speed matters most.

How mode switching works: You call a different method on the same model object — ar_generate(), generate(), or linear_spec_generate(). The model weights do not change.

Installation

Two pip installs. CUDA-capable GPU required.

The model uses trust_remote_code=True because custom modeling code is bundled with the checkpoint on Hugging Face. Install peft only if you plan to use the LoRA-enhanced self-speculation mode.

Step 1 — core dependencies

Copy
pip install “transformers>=5.0.0” torch accelerate

Step 2 — optional: LoRA-enhanced self-speculation

Copy
pip install peft

Step 3 — load model (swap model ID for 3B or 14B)

Copy
from transformers import AutoModel, AutoTokenizer
import torch

# Available: nvidia/Nemotron-Labs-Diffusion-3B
# nvidia/Nemotron-Labs-Diffusion-8B
# nvidia/Nemotron-Labs-Diffusion-14B
repo = “nvidia/Nemotron-Labs-Diffusion-8B”

tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)

Basic Usage — All Three Modes

Prepare the prompt once. Choose a generate call.

All three modes share the same tokenization step. The variable nfe (num function evals) returned alongside output IDs lets you measure how many forward passes were used to produce the output.

Shared — build prompt_ids

Copy
history = [{“role”: “user”, “content”: “Explain gradient descent.”}]
prompt = tokenizer.apply_chat_template(history, tokenize=False,
add_generation_prompt=True)
prompt_ids = tokenizer(prompt, return_tensors=“pt”).input_ids.to(“cuda”)

AR Mode — standard autoregressive

Copy
out_ids, nfe = model.ar_generate(prompt_ids, max_new_tokens=512)

Diffusion Mode — parallel decoding (threshold adjusts speed vs accuracy)

Copy
out_ids, nfe = model.generate(
prompt_ids,
max_new_tokens=512,
block_length=32,
threshold=0.9,
eos_token_id=tokenizer.eos_token_id
)

Decode output — same for all modes

Copy
text = tokenizer.batch_decode(
out_ids[:, prompt_ids.shape[1]:], skip_special_tokens=True
)[0]
print(f“Output: {text}\nNFE: {nfe}”)

Self-Speculation + LoRA Drafter

Highest per-user throughput. Optional LoRA for higher acceptance length.

Without LoRA, average acceptance length is 5.46 tokens per draft step. With LoRA it rises to 6.82, versus 2.75 for Eagle3 and 4.24 for Qwen3-9B-MTP. The LoRA adapter is stored inside the same Hugging Face repo under linear_spec_lora/.

Linear self-speculation — without LoRA

Copy
out_ids, nfe = model.linear_spec_generate(
prompt_ids,
max_new_tokens=512,
block_length=32,
eos_token_id=tokenizer.eos_token_id
)

Linear self-speculation — with LoRA drafter (recommended)

Copy
from peft import PeftModel

repo = “nvidia/Nemotron-Labs-Diffusion-8B”
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16)

# Attach the LoRA adapter from the same repo
model = PeftModel.from_pretrained(
model, repo, subfolder=“linear_spec_lora”
).eval()

# Unwrap to call linear_spec_generate directly
base = model.model

out_ids, nfe = base.linear_spec_generate(
prompt_ids,
max_new_tokens=512,
block_length=32,
eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(
out_ids[0, prompt_ids.shape[1]:], skip_special_tokens=True
))
print(f“NFE: {nfe}”)

Production Serving: vLLM & SGLang

OpenAI-compatible API. Standard curl calls work out of the box.

SGLang was used for all SPEED-Bench measurements in the paper and is the recommended serving framework for self-speculation mode. Both frameworks expose an OpenAI-compatible /v1/chat/completions endpoint.

vLLM — install and serve

Copy
pip install vllm
vllm serve “nvidia/Nemotron-Labs-Diffusion-8B”

SGLang — install and serve

Copy
pip install sglang
python3 -m sglang.launch_server \
–model-path “nvidia/Nemotron-Labs-Diffusion-8B” \
–host 0.0.0.0 –port 30000

Call either server — OpenAI-compatible

Copy
curl -X POST “http://localhost:30000/v1/chat/completions” \
-H “Content-Type: application/json” \
–data ‘{
“model”: “nvidia/Nemotron-Labs-Diffusion-8B”,
“messages”: [{ “role”: “user”, “content”: “Your prompt here.” }]
}’

SGLang with Docker

Copy
docker run –gpus all –shm-size 32g -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
–env “HF_TOKEN=<your_token>” –ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
–model-path “nvidia/Nemotron-Labs-Diffusion-8B” \
–host 0.0.0.0 –port 30000

When to Use Each Mode

Match the mode to your deployment context.

Scenario
Mode
Reason

High-concurrency API (many users)
ar_generate()
GPU is fully saturated by batching. Sequential decoding is not the bottleneck.

Single-user or edge inference
linear_spec_generate() + LoRA
3.3× over AR on GB200. 2.4× over Eagle3 at batch size 1.

Adjustable speed vs accuracy
generate() — diffusion
Tune threshold between 0 and 1. Lower threshold = more tokens per pass = lower accuracy.

Existing AR serving stack
ar_generate()
Drop-in replacement. No infrastructure changes needed.

Coding, math, multilingual tasks
linear_spec_generate() + LoRA
Acceptance length peaks on structured content: 8.57× coding, 8.14× math.

Vision-language, long responses
VLM — linear_spec_generate()
Up to 7.45× TPF on responses over 200 tokens. 0.1% accuracy drop vs AR.

Model collection on Hugging Face: huggingface.co/collections/nvidia/nemotron-labs-diffusion — includes 3B, 8B, 14B base, instruct, and VLM checkpoints.

Key Takeaways

Nemotron-Labs-Diffusion unifies AR, diffusion, and self-speculation decoding in one model, with no mode-specific architectural changes.

Joint AR-diffusion training is not a tradeoff — both objectives peak at α=0.3 and improve together.

Self-speculation mode achieves 5.99× TPF on the 8B model, with 2.4× higher throughput than Qwen3-8B-Eagle3 at batch size 1 on GB200.

Higher acceptance length is the key differentiator: NLD-LoRA averages 6.82 tokens per draft step versus 2.75 for Eagle3 and 4.24 for MTP.

Speed-of-light analysis shows the diffusion mode has a theoretical ceiling of 7.60× TPF — current confidence-based sampling realizes only ~3×, leaving significant room for sampler improvements.

Check out the Paper, Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6× Tokens Per Forward Over Qwen3-8B

Sequential Decoding Limits Throughput

What Is a Tri-Mode Language Model?

Training

LoRA-Enhanced Linear Self-Speculation

Speed-of-Light Analysis

Benchmark Results

Marktechpost’s Visual Explainer

What is Nemotron-Labs-Diffusion?

The Three Decoding Modes

Installation

Basic Usage — All Three Modes

Self-Speculation + LoRA Drafter

Production Serving: vLLM & SGLang

When to Use Each Mode

Key Takeaways

GitHub confirms 3,800 internal repos stolen through poisoned VS Code extension as supply chain worm hits Microsoft’s Python SDK

Corti's new Symphony for Speech-to-Text model beats OpenAI at medical terminology accuracy, highlighting the value of specialized AI

Alibaba’s AI Agent Chip Strategy Goes Beyond Nvidia

Alibaba Qwen Team Introduces Qwen3.5-LiveTranslate-Flash: Real-Time Multimodal Interpretation Across 60 Languages at 2.8-Second Latency

Google just redesigned the search box for the first time in 25 years — here’s why it matters more than you think.

AWS nabs white hot gen AI media creation startup fal, becoming its preferred cloud provider

Leave a Reply Cancel reply

You may have missed

Ethereum retests $2,100, but could ETH crash amid technical breakdown?

Zcash Foundation Patches 2 Critical Zebra Flaws, Reports $817K in Q1 Spending

Never hold your assets on Exchanges!! #cryptonews #crypto #digitalassets #xrpnews #xrp #llc #shorts

GitHub confirms 3,800 internal repos stolen through poisoned VS Code extension as supply chain worm hits Microsoft’s Python SDK

Sitemap

Legal Information

Pin It on Pinterest

Sequential Decoding Limits Throughput

What Is a Tri-Mode Language Model?

Training

LoRA-Enhanced Linear Self-Speculation

Speed-of-Light Analysis

Benchmark Results

Marktechpost’s Visual Explainer

What is Nemotron-Labs-Diffusion?

The Three Decoding Modes

Installation

Basic Usage — All Three Modes

Self-Speculation + LoRA Drafter

Production Serving: vLLM & SGLang

When to Use Each Mode

Key Takeaways

More Stories

Leave a Reply Cancel reply

You may have missed

Sitemap

Legal Information

Categories

Pin It on Pinterest