Top Parallelism Techniques to Enhance LLM Training & Deployment

Tutorials

September 26, 2024

Training large LLMs often faces GPU memory and compute limitations. This blog explores parallelization techniques like data, model, and tensor parallelism to enhance efficiency, speed up training, and optimize AI deployment across multiple GPUs.

Keywords: LLM, Parallelization, GPUs, AI Development, Data Parallelism, Model Parallelism, Tensor Parallelism, Pipeline Parallelism, ZeRO, Mixture of Experts, Cloud Computing, NVIDIA H100.

Introduction

Leveraging the full potential of your compute resources is the number one goal when developing AI-based software, especially for training large models in deep learning. Major bottlenecks and pitfalls root in limited parameters and dataset sizes due to the lack of suitable hardware. Even with the greatest care, it is sometimes impossible to train or even run the desired neural network architecture on the available hardware. Take for example Stable Diffusion, which has over a billion parameters, and try to fine-tune it, you will find yourself limited by hardware very quickly. These challenges are even more pronounced when you need to complete tasks quickly. In both scenarios, it becomes profitable to employ multiple GPUs that synchronize to save model and optimizer states.

To distribute computational load and required memory efficiently, experts have developed several parallelization techniques. Here is a comprehensive overview of these approaches

GPUs are designed to efficiently handle computations in parallel, which is ideal for efficiently training and running large-scale deep learning models, particularly language models. By nature of deep learning architectures, some computations depend on each other and thus have to be performed sequentially, but others can be performed in parallel which allows speeding up the training process, at the cost of utilized computation units, or cores. However, a common constraint is the limited GPU memory available to store all the results from a single data processing step through the model. To update the model’s parameters, all intermediate results are required, and when memory becomes a constraint, distributing the memory load across multiple GPUs becomes necessary. There are multiple techniques surrounding training and inference in target time, trading off speed versus utilized memory. Let’s explore the different use cases and the parallelization approaches best suited to each scenario.

‍

Sufficient memory: Can we go faster?

Data Parallelism

If memory is not a constraint, or if you have already addressed this issue using other parallelization techniques, a straightforward way to accelerate training is data parallelism. This method allows multiple GPUs or machines to run the model on different data batches simultaneously. The only necessary synchronization required is adding up the acquired gradients at the end, introducing negligible additional computational load. Iterating through the same training data can be up to twice as fast with double the number of GPUs in use, depending on the batch size.

Does it speed up the training by the same amount? Generally not. Because gradients are computed in parallel, this technique is equivalent to increasing the batch size. However, it can still significantly speed up the training process. If the goal is inference, this aspect becomes irrelevant.

Let’s say you are running Stable Diffusion on a single NVIDIA^™ H100 Node, with an inference time of 2.6 seconds per batch, using a second H100 Node allows you to process 2 batches in 2.6 seconds, effectively reducing the inference time to 1.3 seconds per batch. This improvement can lead to faster service delivery, enabling you to scale your AI applications and grow your customer base without causing frustrating waiting times.

‍

Memory is a constraint. What to do?

Model Parallelism

When memory becomes a constraint, model parallelism is the key. This technique involves distributing the model across multiple GPUs. This can be done at the individual layer level (Tensor Parallelism), by assigning each GPU a set of layers (Pipeline Parallelism) or by combining these techniques, which is effective for many architectures. The advantage of model parallelism is the distributed memory demand, which enables the training of large models that would otherwise be impossible on a single GPU.

However, because the GPUs need to communicate their individual outputs with each other, the runtime is significantly influenced by this overhead. In general, regardless of the specific method employed, the runtime compared to single GPU training tends to increase by a similar magnitude to the number of GPUs used. This increase is a theoretical consideration if training the model on a single GPU is impossible due to insufficient memory capacity. But for developers looking to maximize cost-efficiency, this overhead becomes an essential factor to consider.

‍

Pipeline Parallelism

As a type of model parallelism, this approach targets a model in the simplest way possible. Each sequential part of the model (e.g., each layer) is loaded onto a separate GPU. Running the model involves sequentially instructing the GPUs to run their parts and pass on the output to the next GPU. This distributes the data across many GPUs and is straightforward to implement in frameworks like Pytorch and Tensorflow. Enhancements of this technique exist, which increase the efficiency of running the model. The required memory is reduced by the number of GPUs in use. Let’s say you want to run Llama 3 70B on your hardware. You will require about 40GB of RAM. That’s more than your average high-performance GPU will have. Using two GPUs pipeline parallelism will reduce that amount to roughly 20GB per GPU at the cost of some execution speed. Of course, in practice, the details depend on a multitude of factors but this basic calculation gives a good estimate.

‍

Tensor Parallelism

Tensor parallelism further refines model parallelism by splitting computations at the individual layer level. Different parts of the output are computed by different GPUs and then concatenated. This technique, although conceptually different from pipeline parallelism, is similar in implementation complexity and effect.

‍

Further Improvements

Depending on the model at hand pipeline, layers or components can be distributed in various ways. This distribution requires knowledge of the model's exact architecture and redistribution of the computational load. For many architectures, existing techniques can either redistribute training effort or slightly restructure the models to achieve higher throughput and reduce required memory.

‍

Combined parallelism techniques

ZeRO (Zero Redundancy Optimization)

Researchers at Microsoft in 2020 developed ZeRO, which utilizes a combination of data parallelism and smart techniques to save memory wherever possible, enabling the training of large models on existing hardware. The main mechanism behind this technique is distributing optimizer states—one of the primary consumers of memory during training—across GPUs after computation and aggregation. This approach reduces the active memory consumption significantly. Along with multiple other, less impactful but still beneficial improvements, ZeRO provides an easy-to-implement and effective way to scale up models to impressive dimensions. In their 2020 paper, the authors demonstrated tenfold increases in training efficiency over state-of-the-art methods, highlighting ZeRO's potential for scaling models to cosmological sizes.

‍

Expert Parallelism

Based on the foundational idea published in the 1991 paper “Adaptive Mixture of Local Experts,” the past decade has seen a surge of advancements built on this concept. In 2023, the company “Mistral AI” released their “Mixtral” models, which have since emerged as serious competitors to state-of-the-art large language models (LLMs). The Apache 2.0 licensed Mixtral 8B, for instance, outperforms Llama 2.0 70B with significantly lower compute requirements. Mixture of Experts (MoE) models distribute incoming tokens to several so-called experts, meaning that each token is processed by only parts of the model. This approach increases overall throughput and facilitates parallelization because experts can be distributed across multiple GPUs.

By distributing the 16 experts of a hypothetical model that would typically require a 16 NVIDIA^™ H100 GPU cluster to train on a single GPU each, the throughput during inference could be increased about eightfold. If you expected to process 10,000 tokens a day, this number could jump to 80,000. These efficiency gains also extend to training these models, though the calculations for doing so are more complex. For precise results, experimentation is necessary.

‍

Summary

You have now seen the most important tools to leverage the full potential of your compute resources. Depending on your specific bottleneck, you can reduce memory usage or time consumption by utilizing techniques based on model and data parallelism. By effectively applying these strategies, you can ensure shorter development cycles, lower production costs, and expand your company’s ability to deploy larger models in your AI applications.