The world of artificial intelligence is accelerating at an unprecedented pace. Recently, industry giants like Nvidia, Oracle, Google, and Dell showcased groundbreaking advancements in AI training performance. They reported the time their systems took to train key neural networks, revealing significant leaps in speed and efficiency.
Benchmarking the Future of AI
The latest benchmark tests, MLPerf v4.1, encompass six critical tasks:
- Recommendation Systems
- Pre-training of Large Language Models (GPT-3 and BERT-large)
- Fine-tuning of Llama 2 70B
- Object Detection
- Graph Node Classification
- Image Generation
These tasks reflect the evolving priorities in AI, especially with the surge in generative AI applications. Notably, training models like GPT-3 is so colossal that the benchmark involves training to a checkpoint rather than full convergence. For Llama 2 70B, the focus is on fine-tuning an existing model to specialize in areas like government documents.
Nvidia’s Blackwell Architecture: The B200 GPU
Nvidia’s new B200 GPU, based on the Blackwell architecture, made a remarkable debut. It doubled the performance of its predecessor, the H100, in tasks like GPT-3 training and LLM fine-tuning. Even in recommendation systems and image generation, it achieved performance gains of 64% and 62%, respectively.
Google’s Trillium: The 6th Generation TPU
Google introduced its 6th generation TPU, known as Trillium, which was unveiled just last month. Compared to the previous v5p variant, Trillium delivered up to a 3.8-fold performance boost in GPT-3 training tasks.
However, when stacked against Nvidia’s offerings, the competition tightens. A system with 6,144 TPU v5ps trained GPT-3 to the checkpoint in 11.77 minutes, while an 11,616 Nvidia H100 system did it in 3.44 minutes. Interestingly, Google’s Trillium systems paired with AMD Epyc CPUs instead of the Intel Xeons used with v5p.
In image generation tasks using Stable Diffusion, a model with 2.6 billion parameters, Google’s 1024 TPU system completed training in 2 minutes 26 seconds, coming in about a minute behind a similarly sized Nvidia H100 system.
The Energy Equation
Training AI models is energy-intensive. While the MLPerf benchmarks are beginning to measure power consumption, transparency is still limited. Dell Technologies was the sole participant to report energy usage, with their system consuming 16.4 megajoules over a 5-minute LLM fine-tuning task. This equates to an average power draw of 5.4 kilowatts and approximately 75 cents worth of electricity at average U.S. rates.
Understanding energy consumption is crucial for assessing the true cost and sustainability of AI advancements. As models grow larger and more complex, efficient energy use becomes as important as raw performance.
Comments 1