What is Distributed Training?
Distributed training allows you to train deep learning models across multiple GPUs and machines, reducing training time. PyTorch provides two widely used strategies for distributed training:- DDP (DistributedDataParallel) - Data parallelism for general models
- FSDP (FullyShardedDataParallel) - Sharded data parallelism for large models
When to Use Each Strategy
Use DDP When:
- Your model fits in a single GPU’s memory
- You want simple, efficient data parallelism - that is, processing part of the batch on each GPU
Use FSDP When:
- Your model is too large for a single GPU
- You’re training very large models (GPT, LLaMA, large vision models)
- You want to maximize memory efficiency
- You need to scale to billions of parameters
Quick Comparison
| Feature | DDP | FSDP |
|---|---|---|
| Model Size | Fits in single GPU | Can exceed single GPU memory |
| Memory Efficiency | Model replicated per GPU | Model sharded across GPUs |
| Speed | Fastest for small/medium models | Better for very large models |
| Gradient Sync | After backward pass | During backward pass |
Basic Architecture
DDP Architecture
- Full copy of the model
- Different slice of the data
- Gradients averaged across all GPUs
FSDP Architecture
- Shard of the model (only part of parameters)
- Different slice of the data
- Parameters gathered on-demand from other GPUs during forward/backward passes - this avoids needing to fetch the parameters from RAM, which is much slower than GPU-to-GPU communication
What You’ll Learn
DDP Training
Learn how to use DistributedDataParallel for efficient data parallelism
FSDP Training
Train large models with FullyShardedDataParallel
Data Loading
Set up DistributedSampler and data loaders correctly
Best Practices
Tips and tricks for optimal distributed training