Skip to main content

Features

The library currently contains logging functions that automatically persist logs to the cloud and allow you to see them on the web interface of the Tensor Cloud. Furthermore, we provide the following convenience functions that act as thin wrappers around PyTorch:
  • Automatic Distributed Setup: Initializing multi-GPU training
  • A decorator to ensure data downloading only happens once per node, eliminating race conditions

Installation

Install the SF Tensor library using pip:
pip install sf-tensor

Requirements

  • Python >= 3.8
  • PyTorch with CUDA support
  • NVIDIA GPUs with NCCL support

Quick Start

Here’s a minimal example to get started with distributed training:
import torch
import sf_tensor as sft

# Initialize distributed training
sft.initialize_distributed_training()

# Get the appropriate device for this process
device = sft.get_device()

# Create your model
model = YourModel().to(device)

# Log training progress (only prints on rank 0)
sft.log("Ready to start training!")

Architecture

The SF Tensor library consists of two main modules:
  1. Training Module
    • Distributed training initialization
    • Device management
  2. Persistence Module
    • Logging utilities
    • Data download decorator

Next Steps