benchmark_weight_loading_comparison
¶
A/B benchmark: independent-read vs rank-0-broadcast weight loading.
Compares two strategies
- "before" (independent): every rank reads safetensors from disk to GPU
- "after" (broadcast): rank 0 reads from disk, broadcasts to other ranks
Usage
1 GPU¶
torchrun --nproc_per_node=1 fastvideo/models/loader/benchmarks/benchmark_weight_loading_comparison.py --model-path /path/to/model --subfolder transformer
2 GPUs¶
torchrun --nproc_per_node=2 fastvideo/models/loader/benchmarks/benchmark_weight_loading_comparison.py --model-path /path/to/model --subfolder transformer
4 GPUs¶
torchrun --nproc_per_node=4 fastvideo/models/loader/benchmarks/benchmark_weight_loading_comparison.py --model-path /path/to/model --subfolder transformer
Functions¶
fastvideo.models.loader.benchmarks.benchmark_weight_loading_comparison.load_broadcast
¶
After-PR behavior: rank 0 reads from disk, broadcasts to other ranks.
Source code in fastvideo/models/loader/benchmarks/benchmark_weight_loading_comparison.py
fastvideo.models.loader.benchmarks.benchmark_weight_loading_comparison.load_independent
¶
Before-PR behavior: every rank reads every tensor from disk to GPU.