
Huang’s most interesting offering at the show was the new Groq 3 LPX, a custom rack made for inference loads.
Nvidia says this server, made up of 72 Vera Rubin chips and 256 Language Processing Units (LPU) made by recently acquired Groq, can handle 700 million tokens per second — or 350 times as much as the Hopper platform, The Wall Street Journal reports.
— The inference inflection has arrived, Huang said, — This is the secret sauce.
Nvidia’s offerings have so far been geared toward powerful, and power consuming, chips for model training, and by adding an inference-optimized server — for workloads that come after training — they are closing a gap that was previously filled by custom chips from the likes of Google, Meta and Amazon.
The Groq 3 LPX should also lay the groundwork for «intelligent agentic swarms,» Nvidia says — by significantly boosting high-token-volume tasks at low latency, having an SRAM bandwidth of a whopping 150 TB/s.
Read more: Nvidia: Groq 3 LPX, The Wall Street Journal, CNBC, and Tom’s Hardware.