nvfp4_qat_config ¶

NVFP4 quantization-aware (QAD) linear method, inference path.

Quantizes every targeted linear's weight to NVFP4 once at load time and runs each forward as a registered flashinfer-backed FP4 matmul. The original fp16/bf16 weight is popped immediately after quantization so the half-precision copy does not keep occupying GPU memory — that's what lets a Wan-2.1 pipeline stay fully resident on a single GPU without any CPU offloading.

The quantize / matmul custom ops are owned by :mod:fastvideo.layers.quantization.nvfp4_config and registered under the fastvideo_fp4:: namespace. We reuse them here for two reasons:

Re-registering the same op name in a second module would raise.
The registered ops have register_fake shape/dtype kernels, which is what makes the inference pipeline's per-block torch.compile trace through without graph breaks. Calling raw flashinfer functions (the old behavior of this file, plus a @torch.compile on apply) graph-breaks at every quantize and every matmul.

For QAT training, see nvfp4_qat_train_config which keeps the weight trainable and fake-quantizes on the fly via a straight-through estimator.

Classes¶

fastvideo.layers.quantization.nvfp4_qat_config.NVFP4QATConfig ¶

NVFP4QATConfig(target_layers: tuple[str, ...] | None = None)

Bases: QuantizationConfig

NVFP4 (Wan-style) linear quantization, inference.

Parameters:

Name	Type	Description	Default
`target_layers`	`tuple[str, ...] \| None`	Substrings matched against each linear layer's prefix. A layer is quantized if any substring is contained in its prefix. Defaults to the standard Wan attention + FFN projections (:data:`DEFAULT_FP4_LAYERS`).	`None`

Source code in fastvideo/layers/quantization/nvfp4_qat_config.py

def __init__(self, target_layers: tuple[str, ...] | None = None) -> None:
    super().__init__()
    self.target_layers = (tuple(target_layers) if target_layers else DEFAULT_FP4_LAYERS)

fastvideo.layers.quantization.nvfp4_qat_config.NVFP4QATQuantizeMethod ¶

NVFP4QATQuantizeMethod()

Bases: QuantizeMethodBase

Inference-only NVFP4 linear method with weight popping.

The dense weight parameter is materialized at load time only so that :func:convert_model_to_fp4 can read it once; the loader then removes it via mod._parameters.pop('weight'). From that point forward, apply reads only _fp4_weight / _fp4_weight_scale / _weight_global_sf.

Source code in fastvideo/layers/quantization/nvfp4_qat_config.py

def __init__(self) -> None:
    super().__init__()
    self.weight_fp4 = None
    self.weight_scale = None
    # Static input global scale factor. Matches the FastVideo-Quantization
    # production path; recomputing it per-call via a ``.max()`` reduction
    # (the previous behavior) adds a sync point, costs a kernel launch,
    # and produces a data-dependent value that prevents CUDA-graph
    # capture under ``torch.compile(mode='reduce-overhead')``.
    self.x_global_sf = torch.tensor(1.0, device="cuda", dtype=torch.float32)

Functions:¶

fastvideo.layers.quantization.nvfp4_qat_config.convert_model_to_fp4 ¶

convert_model_to_fp4(model: Module) -> None

Prequantize every FP4-tagged linear and drop its dense weight.

Walks the module tree, and for each layer whose quant_method is an :class:NVFP4QATQuantizeMethod, computes the NVFP4 packed weight / scale / global-scale buffers, then pops the original fp16/bf16 weight parameter so it no longer occupies GPU memory.

Source code in fastvideo/layers/quantization/nvfp4_qat_config.py

def convert_model_to_fp4(model: torch.nn.Module) -> None:
    """Prequantize every FP4-tagged linear and drop its dense weight.

    Walks the module tree, and for each layer whose ``quant_method`` is
    an :class:`NVFP4QATQuantizeMethod`, computes the NVFP4 packed weight
    / scale / global-scale buffers, then pops the original fp16/bf16
    ``weight`` parameter so it no longer occupies GPU memory.
    """
    SfLayout, _, _ = _require_flashinfer()
    from torch.distributed.tensor import DTensor  # type: ignore

    with torch.no_grad():
        for mod in model.modules():
            qm = getattr(mod, "quant_method", None)
            if not isinstance(qm, NVFP4QATQuantizeMethod):
                continue

            weight = getattr(mod, "weight", None)
            if weight is None:
                continue

            weight_local = weight.to_local() if isinstance(weight, DTensor) else weight  # type: ignore[arg-type]

            # Only the reduced scalar needs fp32; avoid a full fp32 copy.
            weight_absmax = (weight_local.detach().abs().nan_to_num().amax().to(dtype=torch.float32))
            weight_global_sf = (448 * 6) / weight_absmax
            fp4_w, fp4_s = _nvfp4_quantize(
                weight_local,
                weight_global_sf,
                sfLayout=SfLayout.layout_128x4,
                do_shuffle=False,
            )
            mod.register_buffer("_fp4_weight", fp4_w, persistent=False)
            mod.register_buffer("_fp4_weight_scale", fp4_s, persistent=False)
            mod.register_buffer(
                "_weight_global_sf",
                weight_global_sf.to(dtype=torch.bfloat16),
                persistent=False,
            )

            # Drop the dense weight as soon as the fp4 buffers are installed
            # so it cannot keep occupying GPU memory.
            removed_weight = mod._parameters.pop("weight", None)
            if removed_weight is not None:
                removed_weight.grad = None
            del removed_weight, weight, weight_local, weight_absmax
    gc.collect()
    torch.cuda.empty_cache()