In recent years, the focus of AI has shifted from training to inference, and NVIDIA is aiming to reshape this space with its newly announced LPU (Language Processing Unit) chips at last week’s GTC conference.
During the event, NVIDIA’s Chief Scientist Bill Dally sat down with Google’s Chief Scientist Jeff Dean for a deep technical discussion. Dally highlighted that the real bottleneck in AI inference today isn’t raw compute power—it’s communication overhead.
- On-chip communication: Current designs suffer from delays of several hundred nanoseconds when signals travel across the chip. NVIDIA’s new approach uses static scheduling for on-chip communication, eliminating routing, queuing, and arbitration. This could reduce latency to around 30 nanoseconds, nearly approaching the speed of light.
- Off-chip communication: Bandwidth has been pushed to 400Gbps and even 800Gbps, but this introduces complex signal processing and error correction. Dally suggested that lowering bandwidth to 200Gbps simplifies the system dramatically, leaving only serialization delays of a few clock cycles.
- Performance vision: Dally expressed confidence that future AI inference could reach 10,000–20,000 tokens per second per user. For context, today’s large language models typically deliver 60–100 tokens per second, with 100 tokens/sec already considered fast.
This leap would represent a massive acceleration in AI responsiveness, making real-time, high-throughput inference practical for everyday use.
-=||=-FavoriteLike (0)



Must log in before commenting!
Sign In Sign Up