In distributed training, multiple processes run simultaneously. When downloading datasets, you need to ensure the download happens only once per node to avoid:
Race conditions and file corruption
Wasted bandwidth from duplicate downloads
Slower initialization times
SF Tensor provides the @dataDownload decorator to solve this problem.
The @dataDownload decorator ensures that data-loading operations only execute on the primary CPU of each node (LOCAL_RANK=0). All other processes skip the function and wait.
After downloading data, use a barrier to ensure all processes wait before accessing the data:
Copy
import torch.distributed as dist# Download on rank 0 only@dataDownloaddef download_data(): # Download logic here passdownload_data()# Wait for download to complete on all processesif dist.is_initialized(): dist.barrier()# Now all processes can safely access the data
Always use a barrier after @dataDownload: Without a barrier, non-rank-0 processes might try to load data before the download completes, causing errors.
After loading data, use DistributedSampler to partition it across GPUs:
Copy
from torch.utils.data import DataLoaderfrom torch.utils.data.distributed import DistributedSamplerimport torch.distributed as dist# Create dataset (after download and barrier)dataset = YourDataset()# Create sampler for distributed trainingsampler = DistributedSampler( dataset, num_replicas=dist.get_world_size() if dist.is_initialized() else 1, rank=dist.get_rank() if dist.is_initialized() else 0, shuffle=True, seed=42 # For reproducibility)# Create data loaderdataloader = DataLoader( dataset, batch_size=32, sampler=sampler, # Use sampler instead of shuffle num_workers=4, pin_memory=True, drop_last=True # Recommended for distributed training)# In your training loop, set epoch for proper shufflingfor epoch in range(num_epochs): sampler.set_epoch(epoch) # Important for reproducible shuffling for batch in dataloader: # Training step... pass
Forgetting the Barrier: If you don’t call dist.barrier() after downloading, non-rank-0 processes might try to load data before it’s ready, causing FileNotFoundError.
Not Using DistributedSampler: Without DistributedSampler, all processes will see the same data, effectively wasting compute resources and not achieving true data parallelism.
Using shuffle=True with DistributedSampler: Don’t use shuffle=True in DataLoader when using DistributedSampler. The sampler handles shuffling:
Batch Size Scales with Number of GPUs: When using distributed training, your effective batch size is batch_size × num_gpus. If you set batch_size=32 in your DataLoader and train on 8 GPUs, your effective global batch size is 256. This affects:
Learning rate scaling: You typically need to scale your learning rate proportionally (e.g., if batch size increases 8×, scale LR by 8×)
Memory usage: Each GPU processes its local batch size (32 in this example)
Convergence behavior: Larger effective batch sizes change training dynamics
Copy
# If training on 8 GPUs with batch_size=32# Effective global batch size = 32 × 8 = 256dataloader = DataLoader( dataset, batch_size=32, # Per-GPU batch size sampler=sampler)# Adjust learning rate accordinglybase_lr = 0.1num_gpus = dist.get_world_size()lr = base_lr * num_gpus # Linear scaling rule