Skip to main content

Welcome to Tensor Cloud

This documentation covers distributed deep learning training across multiple GPUs and nodes using the Tensor Cloud, and contains an introduction to distributed training with PyTorch.

What You’ll Learn

This documentation is organized into two main sections:

1. SF Tensor Library

Documentation for the SF Tensor Python library

2. Distributed Training in PyTorch

A basic introduction to DDP and FSDP for training models in PyTorch across multiple GPUs.

Quick Start

Install the SF Tensor library:
pip install sf-tensor
Here’s a minimal example to get started:
import torch
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
import sf_tensor as sft
from sf_tensor.persist import dataDownload

# Initialize distributed training
sft.initialize_distributed_training()

# Download data (only once per node)
@dataDownload
def download_data():
    # Your download logic here
    pass

download_data()

# Get device for this process
device = sft.get_device()

# Create and wrap model in DDP
model = YourModel().to(device)
model = DDP(model, device_ids=[device])

# Create data loader with DistributedSampler
train_sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(train_dataset, sampler=train_sampler)

# Training loop
for epoch in range(num_epochs):
    train_sampler.set_epoch(epoch)  # Important!

    for batch in train_loader:
        loss = train_step(model, batch, device)

    sft.log(f"Epoch {epoch} complete")  # Only prints once
If you’re using the Tensor Cloud, we’ll deal with getting all nodes and GPUs running automatically for you. If you’re not using the Tensor Cloud, you’ll need to use torchrun on each of the nodes, for example:
# Single node, 8 GPUs
torchrun --nproc_per_node=8 train.py

# Multi-node (2 nodes × 8 GPUs)
# Node 0:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
         --rdzv_endpoint=10.0.0.1:29500 train.py

# Node 1:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
         --rdzv_endpoint=10.0.0.1:29500 train.py

Getting Started

Features

Simple Setup

Initialize distributed training with a single function call

Automatic Sync

Data download decorator prevents race conditions and duplicate downloads

Clean Logging

Rank-aware logging ensures messages print only once

Contents

SF Tensor Library Features

  • One-line initialization - sft.initialize_distributed_training()
  • Automatic device management - sft.get_device()
  • Data downloading - @dataDownload decorator
  • Persisted logging - sft.log() and sft.logAccuracy()

Distributed Training guide

  • DDP support - Fast training for standard models
  • FSDP support - Memory-efficient training for large models
  • DistributedSampler - Automatic data partitioning
  • Mixed precision - FP16/BF16 for faster training
  • Multi-node scaling - Training across multiple machines

Getting Started

Choose your starting point based on your needs:
  1. New to distributed training? Start with SF Tensor Library Overview
  2. Ready to train models? Jump to DDP Training
  3. Training large models? Check out FSDP Training
  4. Need data loading help? See Data Loading Guide