What is DDP?
DistributedDataParallel (DDP) is PyTorch’s recommended module for multi-GPU training. It works as follows:- Replicates the model on each GPU
- You feed each process a different part of the batch (e.g., with DistributedSampler)
- Computes gradients independently on each GPU
- Averages gradients across all GPUs
- Updates model parameters synchronously
When to Use DDP
✅ Use DDP for:- Models that fit in a single GPU’s memory
- Models too large for a single GPU (use FSDP instead)
Basic Setup with SF Tensor
Here’s minimal code to use DDP with the convience functions provided by the Python library:Complete Training Example
Lei Mao has a good full example of distributed training. Note that, when using the Tensor Cloud, we run the torchrun commands for you automatically, and you can use our convenience functions if you wish.Key DDP Parameters
When wrapping your model in DDP, you can configure several parameters:Important Parameters
device_ids
device_ids
Specifies which GPU(s) to use for this process. Usually a single device obtained from
sft.get_device().broadcast_buffers
broadcast_buffers
Controls whether to synchronize buffers (like BatchNorm running statistics) during forward pass.
True(default): Buffers synced at every forward passFalse: Buffers not synced (faster, but may affect BatchNorm)
find_unused_parameters
find_unused_parameters
Set to
True if some model parameters don’t receive gradients (e.g., in conditional architectures).False(default): Assumes all params get gradientsTrue: Handles unused params (slower)
Saving and Loading Models
When using DDP, the model is wrapped, so you need to access the underlying module:Saving
Use
model.module.state_dict() instead of model.state_dict() to save the underlying model without DDP wrapper.Loading
Performance Tips
1. Use Gradient Accumulation
For larger effective batch sizes without increasing memory:2. Use Mixed Precision Training
Reduce memory usage and speed up training:3. Pin Memory and Non-Blocking Transfer
Common Issues and Solutions
RuntimeError: Expected to mark a variable ready only once
RuntimeError: Expected to mark a variable ready only once
Cause: DDP expects all parameters to receive gradients in every backward pass.Solution: Set
find_unused_parameters=True if some parameters don’t always get gradients:Gradients not synchronizing properly
Gradients not synchronizing properly
Cause: Not calling
backward() on all processes or using different computational graphs.Solution: Ensure all processes call backward() with the same model structure:Different results on different GPUs
Different results on different GPUs
Cause: Different random seeds or not using DistributedSampler.Solution: Set seeds and use DistributedSampler:
BatchNorm behaving differently
BatchNorm behaving differently
Cause: BatchNorm statistics not synchronized across GPUs.Solution: Use
SyncBatchNorm for synchronized batch normalization:Prefer DistributedDataParallel (DDP) over nn.DataParallel for multi‑GPU training — PyTorch recommends DDP and it’s significantly faster and more scalable. (
nn.DataParallel remains available but is not recommended.)