Skip to main content

Overview

In distributed training, each process needs to be assigned to a specific GPU. SF Tensor provides the get_device() function to automatically determine the correct device for each process.

The get_device() Function

def get_device() -> torch.device
Returns the appropriate torch.device for the current process:
  • Distributed training: Returns cuda:{LOCAL_RANK}
  • Single GPU: Returns cuda:0

Basic Usage

import torch
import sf_tensor as sft

if __name__ == "__main__":
    # Initialize distributed training
    sft.initialize_distributed_training()

    # Get the device for this process
    device = sft.get_device()

    # Create model and move to device
    model = YourModel()
    model = model.to(device)

    # Training loop
    for batch in dataloader:
        data, target = batch
        # Move data to device
        data = data.to(device)
        target = target.to(device)

        # Forward pass
        output = model(data)
        loss = criterion(output, target)

        # Backward pass
        loss.backward()
        optimizer.step()

Multi-Node Device Assignment

On multi-node setups, device assignment works seamlessly: 2 Nodes × 4 GPUs Each (8 GPUs Total)
NodeProcessRANKLOCAL_RANKDevice
0000cuda:0
0111cuda:1
0222cuda:2
0333cuda:3
1440cuda:0
1551cuda:1
1662cuda:2
1773cuda:3
Each node has 4 GPUs (0-3), and get_device() returns the correct local GPU for each process using LOCAL_RANK.

Best Practices

Always use sft.get_device() instead of hardcoding device assignments:
# Good
device = sft.get_device()
model = model.to(device)

# Bad (won't work in distributed training)
model = model.to('cuda:0')
Use non_blocking=True when moving data to GPU for better performance:
# Faster (overlaps transfer with computation)
data = data.to(device, non_blocking=True)

# Slower (blocks until transfer completes)
data = data.to(device)
When possible, create tensors directly on the device to avoid unnecessary transfers:
# Good (created directly on GPU)
tensor = torch.randn(100, 100, device=device)

# Less efficient (created on CPU, then moved)
tensor = torch.randn(100, 100).to(device)
Enable pin_memory=True in DataLoader for faster CPU-to-GPU transfer:
train_loader = DataLoader(
    dataset,
    batch_size=32,
    pin_memory=True  # Faster transfer to GPU
)

Troubleshooting

Device Mismatch Error: If you see “Expected tensor to be on device X but was on device Y”, ensure all tensors and model parameters are on the same device:
# Check model device
model_device = next(model.parameters()).device

# Check tensor device
tensor_device = tensor.device

# They must match!
assert model_device == tensor_device
Out of Memory: If you run out of GPU memory, try:
  1. Reducing batch size
  2. Using gradient accumulation
  3. Enabling mixed precision training
  4. Using gradient checkpointing for large models