NVIDIA DGX Spark Now Scales to 4 Nodes for 700B Parameter AI Agents
Rebeca Moen
Mar 16, 2026 21:42
NVIDIA expands DGX Spark to support 4-node configurations, enabling local inference of 700B parameter models and near-linear fine-tuning performance scaling.
NVIDIA has expanded its DGX Spark desktop AI platform to support up to four nodes, quadrupling available memory to 512 GB and enabling local inference of models up to 700 billion parameters. The upgrade, announced alongside the NemoClaw agent toolkit, positions DGX Spark as a serious contender for enterprises wanting to run autonomous AI agents without cloud dependencies.
The scaling numbers tell the story. Token generation throughput jumps from 18,400 tokens per second on a single node to 74,600 on four nodes—a clean 4x improvement for fine-tuning workloads. For inference tasks, time per output token drops from 269ms to 72ms when scaling from one to four nodes using tensor parallelism.
Why This Matters for AI Agent Development
Autonomous agents are memory hungry. NVIDIA’s benchmarks show agents routinely processing 30K-120K token context windows, with complex requests hitting 250K tokens. That’s roughly equivalent to reading two full novels before responding to a single query.
The DGX Spark handles this through what NVIDIA calls the Grace Blackwell Superchip, which parallelizes multiple subagents simultaneously. Running four concurrent subagents requires only 2.6x more time than running one, while prompt processing throughput triples. For developers building multi-agent systems, that’s the difference between waiting minutes versus hours for complex reasoning chains.
Four Topology Options
NVIDIA outlined specific use cases for each configuration. A single node handles inference up to 120B parameters and local agentic workloads. Two nodes support models up to 400B parameters. Three nodes in a ring topology optimize for fine-tuning larger models. The full four-node setup with a RoCE 200 GbE switch creates what NVIDIA calls a “local AI factory” capable of running state-of-the-art 700B parameter models.
Models explicitly called out as benefiting from multi-node stacking include Qwen3.5 397B, GLM 5, and MiniMax M2.5 230B—all popular choices for the OpenClaw autonomous agent runtime that ships with NemoClaw.
The Cloud Bridge
Perhaps the most practical addition is Tile IR, a kernel portability layer letting developers write code once on DGX Spark and deploy to Blackwell B200/B300 data center GPUs with minimal changes. Roofline analysis shows kernels scale effectively relative to each platform’s theoretical peak, meaning optimizations made locally translate to cloud deployments.
This addresses a real pain point. Teams prototype on local hardware, then spend weeks rewriting for production cloud infrastructure. The cuTile Python DSL and TileGym’s preoptimized transformer kernels aim to eliminate that friction.
For enterprises weighing AI infrastructure investments, the expanded DGX Spark capabilities offer a middle path between pure cloud dependency and building out dedicated data center capacity. The ability to run 700B parameter models locally—with a clear upgrade path to cloud scale—makes the economic calculation more interesting than it was six months ago.
Image source: Shutterstock
