is a part of a sequence about distributed AI throughout a number of GPUs:
- Half 1: Understanding the Host and System Paradigm (this text)
- Half 2: Level-to-Level and Collective Operations (coming quickly)
- Half 3: How GPUs Talk (coming quickly)
- Half 4: Gradient Accumulation & Distributed Knowledge Parallelism (DDP) (coming quickly)
- Half 5: ZeRO (coming quickly)
- Half 6: Tensor Parallelism (coming quickly)
Introduction
This information explains the foundational ideas of how a CPU and a discrete graphics card (GPU) work collectively. It’s a high-level introduction designed that will help you construct a psychological mannequin of the host-device paradigm. We’ll focus particularly on NVIDIA GPUs, that are probably the most generally used for AI workloads.
For built-in GPUs, similar to these present in Apple Silicon chips, the structure is barely completely different, and it gained’t be coated on this publish.
The Massive Image: The Host and The System
A very powerful idea to understand is the connection between the Host and the System.
- The Host: That is your CPU. It runs the working system and executes your Python script line by line. The Host is the commander; it’s in control of the general logic and tells the System what to do.
- The System: That is your GPU. It’s a strong however specialised coprocessor designed for massively parallel computations. The System is the accelerator; it doesn’t do something till the Host offers it a job.
Your program all the time begins on the CPU. If you need the GPU to carry out a job, like multiplying two giant matrices, the CPU sends the directions and the information over to the GPU.
The CPU-GPU Interplay
The Host talks to the System by means of a queuing system.
- CPU Initiates Instructions: Your script, operating on the CPU, encounters a line of code supposed for the GPU (e.g.,
tensor.to('cuda')). - Instructions are Queued: The CPU doesn’t wait. It merely locations this command onto a particular to-do checklist for the GPU known as a CUDA Stream — extra on this within the subsequent part.
- Asynchronous Execution: The CPU doesn’t await the precise operation to be accomplished by the GPU, the host strikes on to the following line of your script. That is known as asynchronous execution, and it’s a key to attaining excessive efficiency. Whereas the GPU is busy crunching numbers, the CPU can work on different duties, like getting ready the following batch of knowledge.
CUDA Streams
A CUDA Stream is an ordered queue of GPU operations. Operations submitted to a single stream execute so as, one after one other. Nonetheless, operations throughout completely different streams can execute concurrently — the GPU can juggle a number of unbiased workloads on the identical time.
By default, each PyTorch GPU operation is enqueued on the present energetic stream (it’s normally the default stream which is routinely created). That is easy and predictable: each operation waits for the earlier one to complete earlier than beginning. For many code, you by no means discover this. Nevertheless it leaves efficiency on the desk when you could have work that might overlap.
A number of Streams: Concurrency
The basic use case for a number of streams is overlapping computation with knowledge transfers. Whereas the GPU processes batch N, you’ll be able to concurrently copy batch N+1 from CPU RAM to GPU VRAM:
Stream 0 (compute): [process batch 0]────[process batch 1]───
Stream 1 (knowledge): ────[copy batch 1]────[copy batch 2]───
This pipeline is feasible as a result of compute and knowledge switch occur on separate {hardware} models contained in the GPU, enabling true parallelism. In PyTorch, you create streams and schedule work onto them with context managers:
compute_stream = torch.cuda.Stream()
transfer_stream = torch.cuda.Stream()
with torch.cuda.stream(transfer_stream):
# Enqueue the switch on transfer_stream
next_batch = next_batch_cpu.to('cuda', non_blocking=True)
with torch.cuda.stream(compute_stream):
# This runs concurrently with the switch above
output = mannequin(current_batch)
Be aware the non_blocking=True flag on .to(). With out it, the switch would nonetheless block the CPU thread even whenever you intend it to run asynchronously.
Synchronization Between Streams
Since streams are unbiased, it’s worthwhile to explicitly sign when one depends upon one other. The blunt instrument is:
torch.cuda.synchronize() # waits for ALL streams on the gadget to complete
A extra surgical strategy makes use of CUDA Occasions. An occasion marks a selected level in a stream, and one other stream can wait on it with out halting the CPU thread:
occasion = torch.cuda.Occasion()
with torch.cuda.stream(transfer_stream):
next_batch = next_batch_cpu.to('cuda', non_blocking=True)
occasion.report() # mark: switch is finished
with torch.cuda.stream(compute_stream):
compute_stream.wait_event(occasion) # do not begin till switch completes
output = mannequin(next_batch)
That is extra environment friendly than stream.synchronize() as a result of it solely stalls the dependent stream on the GPU aspect — the CPU thread stays free to maintain queuing work.
For day-to-day PyTorch coaching code you gained’t have to handle streams manually. However options like DataLoader(pin_memory=True) and prefetching rely closely on this mechanism beneath the hood. Understanding streams helps you acknowledge why these settings exist and provides you the instruments to diagnose delicate efficiency bottlenecks after they seem.
PyTorch Tensors
PyTorch is a strong framework that abstracts away many particulars, however this abstraction can generally obscure what is occurring beneath the hood.
If you create a PyTorch tensor, it has two elements: metadata (like its form and knowledge sort) and the precise numerical knowledge. So whenever you run one thing like this t = torch.randn(100, 100, gadget=gadget), the tensor’s metadata is saved within the host’s RAM, whereas its knowledge is saved within the GPU’s VRAM.
This distinction is vital. If you run print(t.form), the CPU can instantly entry this info as a result of the metadata is already in its personal RAM. However what occurs in case you run print(t), which requires the precise knowledge residing in VRAM?
Host-System Synchronization
Accessing GPU knowledge from the CPU can set off a Host-System Synchronization, a typical efficiency bottleneck. This happens every time the CPU wants a end result from the GPU that isn’t but accessible within the CPU’s RAM.
For instance, contemplate the road print(gpu_tensor) which prints a tensor that’s nonetheless being computed by the GPU. The CPU can’t print the tensor’s values till the GPU has completed all of the calculations to acquire the ultimate end result. When the script reaches this line, the CPU is compelled to block, i.e. it stops and waits for the GPU to complete. Solely after the GPU completes its work and copies the information from its VRAM to the CPU’s RAM can the CPU proceed.
As one other instance, what’s the distinction between torch.randn(100, 100).to(gadget) and torch.randn(100, 100, gadget=gadget)? The primary methodology is much less environment friendly as a result of it creates the information on the CPU after which transfers it to the GPU. The second methodology is extra environment friendly as a result of it creates the tensor immediately on the GPU; the CPU solely sends the creation command.
These synchronization factors can severely impression efficiency. Efficient GPU programming includes minimizing them to make sure each the Host and System keep as busy as attainable. In any case, you need your GPUs to go brrrrr.
Scaling Up: Distributed Computing and Ranks
Coaching giant fashions, similar to Giant Language Fashions (LLMs), typically requires extra compute energy than a single GPU can supply. Coordinating work throughout a number of GPUs brings you into the world of distributed computing.
On this context, a brand new and vital idea emerges: the Rank.
- Every rank is a CPU course of which will get assigned a single gadget (GPU) and a singular ID. When you launch a coaching script throughout two GPUs, you’ll create two processes: one with
rank=0and one other withrank=1.
This implies you’re launching two separate cases of your Python script. On a single machine with a number of GPUs (a single node), these processes run on the identical CPU however stay unbiased, with out sharing reminiscence or state. Rank 0 instructions its assigned GPU (cuda:0), whereas Rank 1 instructions one other GPU (cuda:1). Though each ranks run the identical code, you’ll be able to leverage a variable that holds the rank ID to assign completely different duties to every GPU, like having every one course of a distinct portion of the information (we’ll see examples of this within the subsequent weblog publish of this sequence).
Conclusion
Congratulations for studying all the way in which to the top! On this publish, you discovered about:
- The Host/System relationship
- Asynchronous execution
- CUDA Streams and the way they permit concurrent GPU work
- Host-System synchronization
Within the subsequent weblog publish, we are going to dive deeper into Level-to-Level and Collective Operations, which allow a number of GPUs to coordinate advanced workflows similar to distributed neural community coaching.
