Overview
Proper data loading is crucial for distributed training. Each GPU process must:- Load different data (to avoid redundant computation)
- Synchronize data downloads (to avoid race conditions)
- Shuffle data correctly (for reproducible training)
The Three Essential Components
1. @dataDownload Decorator
Ensures datasets are downloaded only once per node:2. Barrier Synchronization
Wait for downloads to complete before all processes access data:3. DistributedSampler
Partition data across GPUs so each sees different samples:DistributedSampler in Detail
Why Use DistributedSampler?
WithoutDistributedSampler, all GPUs would see the same data, wasting compute:
DistributedSampler, each GPU sees different data:
Key Parameters
num_replicas
num_replicas
Total number of processes (GPUs) in distributed training.
- 8 GPUs →
num_replicas=8 - Single GPU →
num_replicas=1
rank
rank
Rank of the current process (which GPU this is).
- GPU 0 →
rank=0 - GPU 1 →
rank=1 - etc.
shuffle
shuffle
Whether to shuffle data at each epoch.Important: Must call
sampler.set_epoch(epoch) for proper shuffling!seed
seed
Random seed for reproducible shuffling.
drop_last
drop_last
Drop the last incomplete batch if dataset size not divisible by world_size.
Setting Epoch for Shuffling
Critical: Callset_epoch() at the start of each epoch for proper shuffling:
- Different shuffle order each epoch
- Same shuffle order across all GPUs (for reproducibility)
DataLoader Configuration
Good Settings
batch_size
batch_size
Batch size per GPU, not total batch size.
sampler vs shuffle
sampler vs shuffle
Use
sampler for distributed training, not shuffle.num_workers
num_workers
Number of worker processes for data loading.
- Too low: CPU bottleneck, GPU underutilized
- Too high: Memory overhead, slower startup
pin_memory
pin_memory
Pin memory for faster CPU→GPU transfer.Enables faster data transfer by using pinned (page-locked) memory.
drop_last
drop_last
Drop the last incomplete batch.Ensures all GPUs process the same number of batches.
persistent_workers
persistent_workers
Keep worker processes alive between epochs (PyTorch 1.7+).Avoids overhead of restarting workers each epoch.
Best Practices
Always Use @dataDownload
Always Use @dataDownload
Always use the
@dataDownload decorator for dataset downloads:Always Use Barrier After Download
Always Use Barrier After Download
Always synchronize after downloading:
Always Call set_epoch()
Always Call set_epoch()
Always set epoch for proper shuffling:
Use pin_memory=True
Use pin_memory=True
Always enable pinned memory for GPU training:
Use drop_last=True for Training
Use drop_last=True for Training
Drop incomplete batches in training: