What is Distributed Training?
Distributed training allows you to train deep learning models across multiple GPUs and machines, reducing training time. PyTorch provides two widely used strategies for distributed training:- DDP (DistributedDataParallel) - Data parallelism for general models
- FSDP (FullyShardedDataParallel) - Sharded data parallelism for large models
When to Use Each Strategy
Use DDP When:
- Your model fits in a single GPU’s memory
- You want simple, efficient data parallelism - that is, processing part of the batch on each GPU
Use FSDP When:
- Your model is too large for a single GPU
- You’re training very large models (GPT, LLaMA, large vision models)
- You want to maximize memory efficiency
- You need to scale to billions of parameters
Quick Comparison
| Feature | DDP | FSDP |
|---|---|---|
| Model Size | Fits in single GPU | Can exceed single GPU memory |
| Memory Efficiency | Model replicated per GPU | Model sharded across GPUs |
| Speed | Fastest for small/medium models | Better for very large models |
| Gradient Sync | After backward pass | During backward pass |
Basic Architecture
DDP Architecture
- Full copy of the model
- Different slice of the data
- Gradients averaged across all GPUs
FSDP Architecture
- Shard of the model (only part of parameters)
- Different slice of the data
- Parameters gathered on-demand from other GPUs during forward/backward passes - this avoids needing to fetch the parameters from RAM, which is much slower than GPU-to-GPU communication