Overview
In distributed training, each process needs to be assigned to a specific GPU. SF Tensor provides theget_device() function to automatically determine the correct device for each process.
The get_device() Function
torch.device for the current process:
- Distributed training: Returns
cuda:{LOCAL_RANK} - Single GPU: Returns
cuda:0
Basic Usage
Multi-Node Device Assignment
On multi-node setups, device assignment works seamlessly: 2 Nodes × 4 GPUs Each (8 GPUs Total)| Node | Process | RANK | LOCAL_RANK | Device |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | cuda:0 |
| 0 | 1 | 1 | 1 | cuda:1 |
| 0 | 2 | 2 | 2 | cuda:2 |
| 0 | 3 | 3 | 3 | cuda:3 |
| 1 | 4 | 4 | 0 | cuda:0 |
| 1 | 5 | 5 | 1 | cuda:1 |
| 1 | 6 | 6 | 2 | cuda:2 |
| 1 | 7 | 7 | 3 | cuda:3 |
get_device() returns the correct local GPU for each process using LOCAL_RANK.
Best Practices
Always Use get_device()
Always Use get_device()
Always use
sft.get_device() instead of hardcoding device assignments:Move Data with non_blocking=True
Move Data with non_blocking=True
Use
non_blocking=True when moving data to GPU for better performance:Create Tensors Directly on Device
Create Tensors Directly on Device
When possible, create tensors directly on the device to avoid unnecessary transfers:
Use pin_memory for DataLoader
Use pin_memory for DataLoader
Enable
pin_memory=True in DataLoader for faster CPU-to-GPU transfer: