Overview

Features

The library currently contains logging functions that automatically persist logs to the cloud and allow you to see them on the web interface of the Tensor Cloud. Furthermore, we provide the following convenience functions that act as thin wrappers around PyTorch:

Automatic Distributed Setup: Initializing multi-GPU training
A decorator to ensure data downloading only happens once per node, eliminating race conditions

Installation

Install the SF Tensor library using pip:

pip install sf-tensor

Requirements

Python >= 3.8
PyTorch with CUDA support
NVIDIA GPUs with NCCL support

Quick Start

Here’s a minimal example to get started with distributed training:

import torch
import sf_tensor as sft

# Initialize distributed training
sft.initialize_distributed_training()

# Get the appropriate device for this process
device = sft.get_device()

# Create your model
model = YourModel().to(device)

# Log training progress (only prints on rank 0)
sft.log("Ready to start training!")

Architecture

The SF Tensor library consists of two main modules:

Training Module
- Distributed training initialization
- Device management
Persistence Module
- Logging utilities
- Data download decorator

Next Steps

Initialization

Learn how to initialize distributed training

Data Loading

Handle datasets efficiently in distributed environments

Logging

Track training metrics and logs

Device Management

Manage GPU devices across processes

SF Tensor Library

Distributed Training

Overview

Features

Installation

Requirements

Quick Start

Architecture

Next Steps

Initialization

Data Loading

Logging

Device Management

Overview

SF Tensor Library

Distributed Training

​Features

​Installation

​Requirements

​Quick Start

​Architecture

​Next Steps

Initialization

Data Loading

Logging

Device Management

Features

Installation

Requirements

Quick Start

Architecture

Next Steps