On today's computing platforms, large AI models can take months to train. And that speed is too slow for businesses.
AI, HPC, and data analytics are becoming increasingly complex as some models, such as large language models, reach trillions of parameters.
The NVIDIA Hopper architecture was built from the ground up to accelerate these next-generation AI workloads with massive computing power and fast memory to handle ever-growing networks and datasets.
The Transformer engine, part of the new Hopper architecture, will significantly improve AI performance and capabilities, and help train large models in days or hours.
Train AI Models with Transformer Engine
Transformer models are the backbone of today's widely used language models such as asBERT and GPT-3. Transformer models were originally developed for natural language processing use cases, but are now increasingly used in computer vision, drug discovery, and more because of their versatility.
At the same time, model sizes continue to grow exponentially, now reaching trillions of parameters. Due to the huge amount of computation, the training time had to be extended to several months, which could not meet the business requirements.
The Transformer engine uses 16-bit floating-point precision and a new 8-bit floating-point data format, and integrates advanced software algorithms to further enhance AI performance and functionality.
AI training relies on floating point numbers, which are decimals, such as 3.14. The TensorFloat32 (TF32) floating point format was introduced with the NVIDIA Ampere architecture and is now the default 32-bit format in the TensorFlow and PyTorch frameworks.
Most AI floating-point operations use 16-bit "half" precision (FP16), 32-bit "single" precision (FP32), and 64-bit "double" precision (FP64) for specialized operations. The Transformer engine shortens operations to 8 bits, enabling larger networks to be trained faster.
When combined with other new features in the Hopper architecture, such as the NVLink Switch system that provides direct high-speed interconnection between nodes, clusters of H100 accelerated servers are capable of training large networks that were previously barely able to operate at the speeds required by enterprises to train.
A closer look at the Transformer engine
The Transformer engine uses software and custom NVIDIA Hopper Tensor Core technology designed to accelerate training of models built on common AI model building blocks known as Transformers. These Tensor Cores can apply mixed FP8 and FP16 precision to dramatically accelerate AI computations for Transformer models. Tensor Core operations with FP8 are twice the throughput of 16-bit operations.
The challenge for models is to intelligently manage precision to maintain accuracy while achieving the performance that can be achieved with smaller, faster numerical formats. The Transformer engine addresses these challenges with a custom, NVIDIA-tuned heuristic algorithm that dynamically chooses between FP8 and FP16 computations and automatically handles reprojection and scaling between these precisions in each layer.
The Transformer Engine uses per-layer statistical analysis to determine the best accuracy (FP16 or FP8) for each layer of the model, achieving the best performance while maintaining model accuracy.
The NVIDIA Hopper architecture also triples the number of floating-point operations per second compared to previous-generation TF32, FP64, FP16, and INT8 precision, building on fourth-generation Tensor Cores. Hopper Tensor Cores combined with the Transformer Engine and 4th Gen NVLink enable orders of magnitude acceleration for HPC and AI workloads.
Accelerate Transformer Engine
Most of the cutting-edge work in AI revolves around large language models like the Megatron 530B. The chart below shows the trend of model size growth in recent years, which is generally believed to continue. Many researchers are already working on hyper-trillion-parameter models for natural language understanding and other applications, suggesting an unabated demand for AI computing power.
Natural language understanding models are still growing rapidly.
To meet the demands of these growing models, high computing power and large amounts of high-speed memory are essential. The combination of the NVIDIA H100 Tensor Core GPU and the acceleration achieved by the Transformer engine can take AI training to the next level.
Through the above innovations, it was possible to increase throughput and reduce training time by a factor of 9—from 7 days to just 20 hours:
Compared to the previous generation, NVIDIA H100 Tensor Core GPUs provide 9x the training throughput, enabling training of large models in a reasonable amount of time.
The Transformer engine can also be used for inference without any data format conversion. Previously, INT8 was the precision of choice for excellent inference performance. However, it requires the trained network to be converted to INT8 as part of the optimization process, which the NVIDIA TensorRT inference optimizer can easily achieve.
When using models trained with FP8 precision, developers can skip this conversion step entirely and perform inference operations with the same precision. As with INT8-formatted networks, deployments using the Transformer engine run with a smaller memory footprint.
On the Megatron 530B, the NVIDIA H100 has 30x higher per-GPU inference throughput than the NVIDIA A100, and a 1 second response latency, suggesting it is an excellent platform for AI deployments:
For low-latency applications, the Transformer engine also increases inference throughput by 30X.