Skip to main content

Overview

In distributed training, each GPU runs a separate process. Without proper handling, logging statements would print multiple times (once per process), cluttering your output and making it difficult to track progress. SF Tensor provides rank-aware logging utilities that automatically ensure messages are only printed once, from the rank 0 process, and persisted to durable storage you can access later and from the web interface.

Available Functions

log()

Log general messages during training.
def log(string: str) -> None
Parameters:
  • string: The message to log
Example:
import sf_tensor as sft

sft.log("Starting training...")
sft.log(f"Epoch 1 completed with loss: {loss:.4f}")
sft.log("Training finished!")

logAccuracy()

Log accuracy metrics during training.
def logAccuracy(accuracy: Union[int, float]) -> None
Parameters:
  • accuracy: Classification accuracy value (int or float)
Example:
import sf_tensor as sft

# Log accuracy after validation
val_accuracy = 0.8542
sft.logAccuracy(val_accuracy)
Both functions automatically check if they’re running on rank 0. Only rank 0 will print the message. Other ranks will silently skip the logging operation.

Basic Usage

import torch
import sf_tensor as sft

if __name__ == "__main__":
    # Initialize distributed training
    sft.initialize_distributed_training()

    # This message prints only once (from rank 0)
    sft.log("Training initialized")

    # Create model
    model = YourModel()
    device = sft.get_device()
    model = model.to(device)

    # Training loop
    for epoch in range(10):
        train_loss = train_one_epoch(model, train_loader)

        # Only prints once, even with multiple GPUs
        sft.log(f"Epoch {epoch + 1}: train_loss={train_loss:.4f}")

        # Validation
        val_acc = validate(model, val_loader)
        sft.logAccuracy(val_acc)

    sft.log("Training complete!")
Output:
Training initialized
Epoch 1: train_loss=2.3054
Epoch 2: train_loss=1.8923
Epoch 3: train_loss=1.5432
...
Training complete!

Structured Logging

import sf_tensor as sft
import json

def log_metrics(metrics: dict, prefix: str = ""):
    """Log dictionary of metrics in structured format"""
    sft.log(f"{prefix}{json.dumps(metrics, indent=2)}")

# Usage
metrics = {
    "epoch": 5,
    "train_loss": 1.234,
    "val_loss": 1.456,
    "train_acc": 0.876,
    "val_acc": 0.854
}

log_metrics(metrics, prefix="Metrics: ")