Overview

What is Distributed Training?

Distributed training allows you to train deep learning models across multiple GPUs and machines, reducing training time. PyTorch provides two widely used strategies for distributed training:

DDP (DistributedDataParallel) - Data parallelism for general models
FSDP (FullyShardedDataParallel) - Sharded data parallelism for large models

When to Use Each Strategy

Use DDP When:

Your model fits in a single GPU’s memory
You want simple, efficient data parallelism - that is, processing part of the batch on each GPU

Use FSDP When:

Your model is too large for a single GPU
You’re training very large models (GPT, LLaMA, large vision models)
You want to maximize memory efficiency
You need to scale to billions of parameters

Quick Comparison

Feature	DDP	FSDP
Model Size	Fits in single GPU	Can exceed single GPU memory
Memory Efficiency	Model replicated per GPU	Model sharded across GPUs
Speed	Fastest for small/medium models	Better for very large models
Gradient Sync	After backward pass	During backward pass

Basic Architecture

DDP Architecture

┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   GPU 0     │  │   GPU 1     │  │   GPU 2     │  │   GPU 3     │
│             │  │             │  │             │  │             │
│   Model     │  │   Model     │  │   Model     │  │   Model     │
│  (Copy 1)   │  │  (Copy 2)   │  │  (Copy 3)   │  │  (Copy 4)   │
│             │  │             │  │             │  │             │
│   Data 1    │  │   Data 2    │  │   Data 3    │  │   Data 4    │
└─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘
       ↓                ↓                ↓                ↓
       └────────────────┴────────────────┴────────────────┘
                    Gradient Synchronization

Each GPU has:

Full copy of the model
Different slice of the data
Gradients averaged across all GPUs

FSDP Architecture

┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   GPU 0     │  │   GPU 1     │  │   GPU 2     │  │   GPU 3     │
│             │  │             │  │             │  │             │
│ Model Shard │  │ Model Shard │  │ Model Shard │  │ Model Shard │
│   Layer 1   │  │   Layer 2   │  │   Layer 3   │  │   Layer 4   │
│             │  │             │  │             │  │             │
│   Data 1    │  │   Data 2    │  │   Data 3    │  │   Data 4    │
└─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘
       ↑                ↑                ↑                ↑
       └────────────────┴────────────────┴────────────────┘
              Gather params across GPUs during Forward/Backward

Each GPU has:

Shard of the model (only part of parameters)
Different slice of the data
Parameters gathered on-demand from other GPUs during forward/backward passes - this avoids needing to fetch the parameters from RAM, which is much slower than GPU-to-GPU communication

What You’ll Learn

DDP Training

Learn how to use DistributedDataParallel for efficient data parallelism

FSDP Training

Train large models with FullyShardedDataParallel

Data Loading

Set up DistributedSampler and data loaders correctly

Best Practices

Tips and tricks for optimal distributed training

SF Tensor Library

Distributed Training

​What is Distributed Training?

​When to Use Each Strategy

​Use DDP When:

​Use FSDP When:

​Quick Comparison

​Basic Architecture

​DDP Architecture

​FSDP Architecture

​What You’ll Learn