grad_norm_regression
¶
Layer-0 grad-norm regression for the per-method training smoke tests.
Phase 2 / 5a-ii: layers a device-keyed grad-norm check on top of the
finite/non-zero grad assertions established in 5a-i. After one
single_train_step + backward, the L2 norm of transformer block 0's
trainable gradients is compared against a reference value pinned per GPU in
grad_norm_refs.json (next to this module).
Determinism: the harness seeds both the global RNG and the method's
cuda_generator via method.on_train_start() (training.data.seed in the
fixture), and the synthetic raw_batch is built after that call, so the
forward/backward is reproducible within bf16 reduction noise on a given GPU.
Why device-keyed: grad norms differ across GPU architectures (kernels,
accumulation order), so a single golden value can't cover every runner. The
JSON currently carries refs for the two GPUs we actually run on — L40S (CI)
and GB200 (our Blackwell dev box; B200 maps to the same key).
Seeding a reference for the current device:
- CI / L40S — invoke
modal runagainstseed_grad_norm_referencesinfastvideo/tests/modal/pr_test.py(pinned togpu="L40S:1"), then copy the recorded value from the log intograd_norm_refs.json. -
Local / non-L40S GPUs — on that workstation::
FASTVIDEO_GRADNORM_UPDATE=1 \ pytest fastvideo/tests/train/methods -vs -rsThe harness writes the measured norm into
grad_norm_refs.jsonunder the device's key and skips the assertion for that run. Append a new substring entry to_DEVICE_MAPPINGSfirst for any device not already listed.
Functions:¶
fastvideo.tests.train.methods.grad_norm_regression.check_grad_norm_regression
¶
Assert block-0 grad norm matches the device-keyed reference within rtol.
- Skips when the current GPU has no reference (unsupported device, or not yet seeded) so a new runner never hard-fails before its golden exists.
- With
FASTVIDEO_GRADNORM_UPDATE=1records/updates the reference for the current device instead of asserting.
Source code in fastvideo/tests/train/methods/grad_norm_regression.py
fastvideo.tests.train.methods.grad_norm_regression.layer0_grad_norm
¶
layer0_grad_norm(transformer) -> float
Global L2 norm of transformer block 0's trainable gradients.
Block 0 is the reference surface 5a-i already isolates: its grad is the last one produced during backprop, so a healthy value implies the whole forward + chain-rule path is intact.
Accumulates the squared sums on the GPU and does a single CPU-GPU sync
(.item()) at the end, rather than one per parameter.
Source code in fastvideo/tests/train/methods/grad_norm_regression.py
fastvideo.tests.train.methods.grad_norm_regression.resolve_device_key
¶
Map a CUDA device name to its reference key, or None if unsupported.
The substring match is case-insensitive so it survives driver/environment
differences in how torch.cuda.get_device_name capitalizes the model.