Tensor Cloud Documentation

Welcome to Tensor Cloud

This documentation covers distributed deep learning training across multiple GPUs and nodes using the Tensor Cloud, and contains an introduction to distributed training with PyTorch.

What You’ll Learn

This documentation is organized into two main sections:

1. SF Tensor Library

Documentation for the SF Tensor Python library

2. Distributed Training in PyTorch

A basic introduction to DDP and FSDP for training models in PyTorch across multiple GPUs.

Quick Start

Install the SF Tensor library:

pip install sf-tensor

Here’s a minimal example to get started:

import torch
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
import sf_tensor as sft
from sf_tensor.persist import dataDownload

# Initialize distributed training
sft.initialize_distributed_training()

# Download data (only once per node)
@dataDownload
def download_data():
    # Your download logic here
    pass

download_data()

# Get device for this process
device = sft.get_device()

# Create and wrap model in DDP
model = YourModel().to(device)
model = DDP(model, device_ids=[device])

# Create data loader with DistributedSampler
train_sampler = DistributedSampler(train_dataset)
train_loader = DataLoader(train_dataset, sampler=train_sampler)

# Training loop
for epoch in range(num_epochs):
    train_sampler.set_epoch(epoch)  # Important!

    for batch in train_loader:
        loss = train_step(model, batch, device)

    sft.log(f"Epoch {epoch} complete")  # Only prints once

If you’re using the Tensor Cloud, we’ll deal with getting all nodes and GPUs running automatically for you. If you’re not using the Tensor Cloud, you’ll need to use torchrun on each of the nodes, for example:

# Single node, 8 GPUs
torchrun --nproc_per_node=8 train.py

# Multi-node (2 nodes × 8 GPUs)
# Node 0:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
         --rdzv_endpoint=10.0.0.1:29500 train.py

# Node 1:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 \
         --rdzv_endpoint=10.0.0.1:29500 train.py

Getting Started

SF Tensor Library

Learn the basics of the SF Tensor library and how it simplifies distributed training

Distributed Training

Understand DDP and FSDP for efficient multi-GPU training

Initialization Guide

Set up distributed training infrastructure automatically

DDP Training

Fast data-parallel training with DistributedDataParallel

FSDP Training

Train large models with FullyShardedDataParallel

Data Loading

Correctly load and distribute data across GPUs

Features

Simple Setup

Initialize distributed training with a single function call

Automatic Sync

Data download decorator prevents race conditions and duplicate downloads

Clean Logging

Rank-aware logging ensures messages print only once

SF Tensor Library Features

✅ One-line initialization - sft.initialize_distributed_training()
✅ Automatic device management - sft.get_device()
✅ Data downloading - @dataDownload decorator
✅ Persisted logging - sft.log() and sft.logAccuracy()

Distributed Training guide

✅ DDP support - Fast training for standard models
✅ FSDP support - Memory-efficient training for large models
✅ DistributedSampler - Automatic data partitioning
✅ Mixed precision - FP16/BF16 for faster training
✅ Multi-node scaling - Training across multiple machines

Getting Started

Choose your starting point based on your needs:

New to distributed training? Start with SF Tensor Library Overview
Ready to train models? Jump to DDP Training
Training large models? Check out FSDP Training
Need data loading help? See Data Loading Guide

Overview

SF Tensor Library

Distributed Training

​Welcome to Tensor Cloud

​What You’ll Learn

​1. SF Tensor Library

​2. Distributed Training in PyTorch

​Quick Start

​Getting Started

SF Tensor Library

Distributed Training

Initialization Guide

DDP Training

FSDP Training

Data Loading

​Features

Simple Setup

Automatic Sync

Clean Logging

​Contents

​SF Tensor Library Features

​Distributed Training guide

​Getting Started

Welcome to Tensor Cloud

What You’ll Learn

1. SF Tensor Library

2. Distributed Training in PyTorch

Quick Start

Getting Started

Features

Contents

SF Tensor Library Features

Distributed Training guide

Getting Started