Device Management

Overview

In distributed training, each process needs to be assigned to a specific GPU. SF Tensor provides the get_device() function to automatically determine the correct device for each process.

The get_device() Function

def get_device() -> torch.device

Returns the appropriate torch.device for the current process:

Distributed training: Returns cuda:{LOCAL_RANK}
Single GPU: Returns cuda:0

Basic Usage

import torch
import sf_tensor as sft

if __name__ == "__main__":
    # Initialize distributed training
    sft.initialize_distributed_training()

    # Get the device for this process
    device = sft.get_device()

    # Create model and move to device
    model = YourModel()
    model = model.to(device)

    # Training loop
    for batch in dataloader:
        data, target = batch
        # Move data to device
        data = data.to(device)
        target = target.to(device)

        # Forward pass
        output = model(data)
        loss = criterion(output, target)

        # Backward pass
        loss.backward()
        optimizer.step()

Multi-Node Device Assignment

On multi-node setups, device assignment works seamlessly: 2 Nodes × 4 GPUs Each (8 GPUs Total)

Node	Process	RANK	LOCAL_RANK	Device
0	0	0	0	cuda:0
0	1	1	1	cuda:1
0	2	2	2	cuda:2
0	3	3	3	cuda:3
1	4	4	0	cuda:0
1	5	5	1	cuda:1
1	6	6	2	cuda:2
1	7	7	3	cuda:3

Each node has 4 GPUs (0-3), and get_device() returns the correct local GPU for each process using LOCAL_RANK.

Best Practices

Always Use get_device()

Always use sft.get_device() instead of hardcoding device assignments:

# Good
device = sft.get_device()
model = model.to(device)

# Bad (won't work in distributed training)
model = model.to('cuda:0')

Move Data with non_blocking=True

Use non_blocking=True when moving data to GPU for better performance:

# Faster (overlaps transfer with computation)
data = data.to(device, non_blocking=True)

# Slower (blocks until transfer completes)
data = data.to(device)

Create Tensors Directly on Device

When possible, create tensors directly on the device to avoid unnecessary transfers:

# Good (created directly on GPU)
tensor = torch.randn(100, 100, device=device)

# Less efficient (created on CPU, then moved)
tensor = torch.randn(100, 100).to(device)

Use pin_memory for DataLoader

Enable pin_memory=True in DataLoader for faster CPU-to-GPU transfer:

train_loader = DataLoader(
    dataset,
    batch_size=32,
    pin_memory=True  # Faster transfer to GPU
)

Troubleshooting

Device Mismatch Error: If you see “Expected tensor to be on device X but was on device Y”, ensure all tensors and model parameters are on the same device:

# Check model device
model_device = next(model.parameters()).device

# Check tensor device
tensor_device = tensor.device

# They must match!
assert model_device == tensor_device

Out of Memory: If you run out of GPU memory, try:

Reducing batch size
Using gradient accumulation
Enabling mixed precision training
Using gradient checkpointing for large models

Overview

SF Tensor Library

Distributed Training

Overview

The get_device() Function

Basic Usage

Multi-Node Device Assignment

Best Practices

Troubleshooting

Overview

SF Tensor Library

Distributed Training

​Overview

​The get_device() Function

​Basic Usage

​Multi-Node Device Assignment

​Best Practices

​Troubleshooting

Overview

The get_device() Function

Basic Usage

Multi-Node Device Assignment

Best Practices

Troubleshooting