Optimizations¶

This page describes the various options for speeding up generation times in FastVideo.

Table of Contents¶

Optimized Attention Backends

Attention Backends¶

Available Backends¶

Torch SDPA: FASTVIDEO_ATTENTION_BACKEND=TORCH_SDPA
Flash Attention 2 and 3: FASTVIDEO_ATTENTION_BACKEND=FLASH_ATTN
Video Sparse Attention: FASTVIDEO_ATTENTION_BACKEND=VIDEO_SPARSE_ATTN
Sage Attention: FASTVIDEO_ATTENTION_BACKEND=SAGE_ATTN
Sage Attention 3: FASTVIDEO_ATTENTION_BACKEND=SAGE_ATTN_THREE
Video MoBA Attention: FASTVIDEO_ATTENTION_BACKEND=VMOBA_ATTN
Sparse Linear Attention: FASTVIDEO_ATTENTION_BACKEND=SLA_ATTN
SageSLA Attention: FASTVIDEO_ATTENTION_BACKEND=SAGE_SLA_ATTN
Sliding Tile Attention (archived branch only): FASTVIDEO_ATTENTION_BACKEND=SLIDING_TILE_ATTN

Configuring Backends¶

There are two ways to configure the attention backend in FastVideo.

1. In Python¶

In python, set the FASTVIDEO_ATTENTION_BACKEND environment variable before instantiating VideoGenerator like this:

os.environ["FASTVIDEO_ATTENTION_BACKEND"] = "VIDEO_SPARSE_ATTN"

2. In CLI¶

You can also set the environment variable on the command line:

FASTVIDEO_ATTENTION_BACKEND=SAGE_ATTN python example.py

Flash Attention¶

FLASH_ATTN

We recommend always installing Flash Attention 2:

pip install flash-attn==2.7.4.post1 --no-build-isolation

And if using a Hopper+ GPU (ie H100), installing Flash Attention 3 by compiling it from source (takes about 10 minutes for me):

git clone https://github.com/Dao-AILab/flash-attention.git && cd flash-attention

cd hopper
pip install ninja
python setup.py install

Sliding Tile Attention (Archived)¶

SLIDING_TILE_ATTN

The full STA integration in fastvideo/ is archived from main and preserved at:

https://github.com/hao-ai-lab/FastVideo/tree/sta_do_not_delete

We keep STA off main because we believe VSA is strictly better than STA for the actively maintained FastVideo path.

Kernel code in fastvideo-kernel is still retained. For mask search and STA inference workflow, see STA docs.

Video Sparse Attention¶

VIDEO_SPARSE_ATTN

Video Sparse Attention is provided by fastvideo-kernel. See VSA docs for installation details.

Sage Attention¶

SAGE_ATTN

To use SageAttention 2.1.1, please compile from source:

git clone https://github.com/thu-ml/SageAttention.git
cd sageattention
python setup.py install  # or pip install -e .

Sage Attention 3¶

SAGE_ATTN_THREE

SageAttention 3 is an advanced attention mechanism that leverages FP4 quantization and Blackwell GPU Tensor Cores for significant performance improvements.

Hardware Requirements¶

RTX5090

Installation¶

Note that Sage Attention 3 requires python>=3.13, torch>=2.8.0, CUDA >=12.8. If you are using uv and using torch==2.8.0 make sure that sentencepiece==0.2.1 in the pyproject.toml file.

To use Sage Attention 3 in FastVideo, follow the README.md in the linked repository to install the package from source.

V-MoBA / SLA / SageSLA¶

These backends are model-specific and require the corresponding kernels and dependencies. Use the support matrix and model examples to confirm compatibility before enabling them.

Benchmarking different optimizations¶

To benchmark backend performance, generate the same prompt with the same seed and compare end-to-end generation times:

import os
import time

for backend in ["TORCH_SDPA", "FLASH_ATTN", "SAGE_ATTN"]:
    os.environ["FASTVIDEO_ATTENTION_BACKEND"] = backend
    generator = VideoGenerator.from_pretrained("your-model-id")
    start_time = time.perf_counter()
    generator.generate_video(
        prompt="Your prompt",
        seed=1024,
    )
    elapsed = time.perf_counter() - start_time
    print(f"{backend}: {elapsed:.2f}s")

Note: reinstantiate VideoGenerator after changing FASTVIDEO_ATTENTION_BACKEND.