NVIDIA and Google infrastructure cuts AI inference costs

At the Google Cloud Next conference, Google and NVIDIA outlined their hardware roadmap designed to address the cost of AI inference at scale.

The companies detailed the new A5X bare-metal instances, which run on NVIDIA Vera Rubin NVL72 rack-scale systems. Through hardware and software codesign, this architecture aims to deliver up to ten times lower inference cost per token compared to previous generations, while concurrently achieving ten times higher token throughput per megawatt.

Connecting thousands of processors requires massive bandwidth to prevent processing delays. The A5X instances address this hardware challenge by pairing NVIDIA ConnectX-9 SuperNICs with Google Virgo networking technology.

This configuration scales to 80,000 NVIDIA Rubin GPUs within a single site cluster, and up to 960,000 GPUs across a multisite deployment. Operating at this scale requires sophisticated workload management, as routing data across nearly a million parallel processors demands exact synchronisation to avoid idle compute time.

Mark Lohmeyer, VP and GM of AI and Computing Infrastructure at Google Cloud, said: “At Google Cloud, we believe the next decade of AI will be shaped by customers’ ability to run their most demanding workloads on a truly integrated, AI‑optimised infrastructure stack.

“By combining Google Cloud’s scalable infrastructure and managed AI services with NVIDIA’s industry‑leading platforms, systems and software, we’re giving customers flexibility to train, tune, and serve everything from frontier and open models to agentic and physical AI workloads—while optimising for performance, cost, and sustainability.”

Sovereign data governance and cloud security requirements

Beyond raw processing capabilities, data governance remains a primary issue for enterprise deployments. Highly regulated sectors, including finance and healthcare, often stall machine learning initiatives due to data sovereignty requirements and the risks of exposing proprietary information.

To address these compliance mandates, Google Gemini models running on NVIDIA Blackwell and Blackwell Ultra GPUs are entering preview on Google Distributed Cloud. This deployment method allows organisations to retain frontier models entirely within their controlled environments, alongside their most sensitive data stores.

The architecture incorporates NVIDIA Confidential Computing. This hardware-level security protocol ensures that training models operate within a protected environment where prompts and fine-tuning data remain encrypted. The encryption prevents unauthorised parties, including the cloud infrastructure operators themselves, from viewing or altering the underlying data.

For multi-tenant public cloud environments, a preview of Confidential G4 VMs equipped with NVIDIA RTX PRO 6000 Blackwell GPUs introduces these same cryptographic protections, giving regulated industries access to high-performance hardware without violating data privacy standards. This release represents the first cloud-based confidential computing offering for NVIDIA Blackwell GPUs.

Operational overhead in agentic AI training

Building multi-step agentic systems requires connecting large language models to complex application programming interfaces, maintaining continuous vector database synchronisation, and actively mitigating algorithmic hallucinations during execution.

To streamline this heavy engineering requirement, NVIDIA Nemotron 3 Super is now available on the Gemini Enterprise Agent Platform. The platform provides developers with tools to customise and deploy reasoning and multimodal models specifically designed for agentic tasks. The broader NVIDIA platform on Google Cloud is optimised for various models – including Google’s Gemini and Gemma families – giving developers the tools to construct systems that reason, plan, and act.

Training these models at scale introduces heavy operational overhead, particularly when managing cluster sizing and hardware failures during long reinforcement learning cycles.

Google Cloud and NVIDIA introduced Managed Training Clusters on the Gemini Enterprise Agent Platform, which includes a managed reinforcement learning API built with NVIDIA NeMo RL. This system automates cluster sizing, failure recovery, and job execution, allowing data science teams to concentrate on model quality rather than low-level infrastructure management.

CrowdStrike actively utilises NVIDIA NeMo open libraries, including NeMo Data Designer and NeMo Megatron Bridge, to generate synthetic data and fine-tune models for domain-specific cybersecurity applications. Operating these models on Managed Training Clusters with Blackwell GPUs accelerates their automated threat detection and response capabilities.

Legacy architecture integration and physical simulations

The integration of machine learning into heavy industry and manufacturing presents a different class of engineering challenges. Connecting digital models to physical factory floors requires exact physical simulations, massive compute power, and standardisation across legacy data formats. NVIDIA’s AI infrastructure and physical AI libraries are now available on Google Cloud, providing the foundation for organisations to simulate and automate real-world manufacturing workflows.

Major industrial software providers – such as Cadence and Siemens – have made their solutions available on Google Cloud, accelerated by NVIDIA infrastructure. These tools power the engineering and manufacturing of heavy machinery, aerospace platforms, and autonomous vehicles.

Manufacturing firms often run on decades-old product lifecycle management systems, making the translation of geometry and physics data difficult. By utilising NVIDIA Omniverse libraries and the open-source NVIDIA Isaac Sim framework via the Google Cloud Marketplace, developers can bypass some of these translation issues to construct physically accurate digital twins and train robotics simulation pipelines prior to physical deployment.

Deploying NVIDIA NIM microservices, such as the Cosmos Reason 2 model, to Google Vertex AI and Google Kubernetes Engine enables vision-based agents and robots to interpret and navigate their physical surroundings. Together, these platforms help developers advance from computer-aided design directly to living industrial digital twins.

Impacts across the accelerated compute ecosystem

Translating these hardware specifications into quantifiable financial returns requires inspecting how early adopters utilise the infrastructure.

The broad portfolio includes options scaling from full NVL72 racks down to fractional G4 VMs offering just one-eighth of a GPU. This allows customers to precisely provision acceleration capabilities for mixture-of-experts reasoning and data processing tasks.

Thinking Machines Lab scales its Tinker API on A4X Max VMs to accelerate training. OpenAI uses large-scale inference on NVIDIA GB300 and GB200 NVL72 systems on Google Cloud to handle demanding workloads, including ChatGPT operations.

Snap transitioned its data pipelines to GPU-accelerated Spark on Google Cloud to cut the extensive costs associated with large-scale A/B testing. In the pharmaceutical sector, Schrödinger leverages NVIDIA accelerated computing on Google Cloud to compress drug discovery simulations that previously took weeks into a matter of hours.

The developer ecosystem scaling these tools has expanded quickly. Over 90,000 developers joined the joint NVIDIA and Google Cloud developer community within a year.

Startups like CodeRabbit and Factory apply NVIDIA Nemotron-based models on Google Cloud to execute code reviews and run autonomous software development agents. Aible, Mantis AI, Photoroom, and Baseten build enterprise data, video intelligence, and generative imagery solutions using the full-stack platform.

Together, NVIDIA and Google Cloud aim to provide a computing foundation designed to advance experimental agents and simulations into production systems that secure fleets and optimise factories in the physical world.

See also: Reversing enterprise security costs with AI vulnerability discovery

Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The comprehensive event is part of TechEx and is co-located with other leading technology events including the Cyber Security & Cloud Expo. Click here for more information.

AI News is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.

Source link