Skip to content

Optimizations

This page describes the various options for speeding up generation times in FastVideo.

Table of Contents

Attention Backends

Available Backends

  • Torch SDPA: FASTVIDEO_ATTENTION_BACKEND=TORCH_SDPA
  • Flash Attention 2 and 3: FASTVIDEO_ATTENTION_BACKEND=FLASH_ATTN
  • Video Sparse Attention: FASTVIDEO_ATTENTION_BACKEND=VIDEO_SPARSE_ATTN
  • Sage Attention: FASTVIDEO_ATTENTION_BACKEND=SAGE_ATTN
  • Sage Attention 3: FASTVIDEO_ATTENTION_BACKEND=SAGE_ATTN_THREE
  • Video MoBA Attention: FASTVIDEO_ATTENTION_BACKEND=VMOBA_ATTN
  • Sparse Linear Attention: FASTVIDEO_ATTENTION_BACKEND=SLA_ATTN
  • SageSLA Attention: FASTVIDEO_ATTENTION_BACKEND=SAGE_SLA_ATTN
  • Sliding Tile Attention (archived branch only): FASTVIDEO_ATTENTION_BACKEND=SLIDING_TILE_ATTN

Configuring Backends

There are two ways to configure the attention backend in FastVideo.

1. In Python

In python, set the FASTVIDEO_ATTENTION_BACKEND environment variable before instantiating VideoGenerator like this:

os.environ["FASTVIDEO_ATTENTION_BACKEND"] = "VIDEO_SPARSE_ATTN"

2. In CLI

You can also set the environment variable on the command line:

FASTVIDEO_ATTENTION_BACKEND=SAGE_ATTN python example.py

Flash Attention

FLASH_ATTN

We recommend always installing Flash Attention 2:

pip install flash-attn==2.7.4.post1 --no-build-isolation

And if using a Hopper+ GPU (ie H100), installing Flash Attention 3 by compiling it from source (takes about 10 minutes for me):

git clone https://github.com/Dao-AILab/flash-attention.git && cd flash-attention

cd hopper
pip install ninja
python setup.py install

Sliding Tile Attention (Archived)

SLIDING_TILE_ATTN

The full STA integration in fastvideo/ is archived from main and preserved at:

  • https://github.com/hao-ai-lab/FastVideo/tree/sta_do_not_delete

We keep STA off main because we believe VSA is strictly better than STA for the actively maintained FastVideo path.

Kernel code in fastvideo-kernel is still retained. For mask search and STA inference workflow, see STA docs.

Video Sparse Attention

VIDEO_SPARSE_ATTN

Video Sparse Attention is provided by fastvideo-kernel. See VSA docs for installation details.

Sage Attention

SAGE_ATTN

To use SageAttention 2.1.1, please compile from source:

git clone https://github.com/thu-ml/SageAttention.git
cd sageattention
python setup.py install  # or pip install -e .

Sage Attention 3

SAGE_ATTN_THREE

SageAttention 3 is an advanced attention mechanism that leverages FP4 quantization and Blackwell GPU Tensor Cores for significant performance improvements.

Hardware Requirements

  • RTX5090

Installation

Note that Sage Attention 3 requires python>=3.13, torch>=2.8.0, CUDA >=12.8. If you are using uv and using torch==2.8.0 make sure that sentencepiece==0.2.1 in the pyproject.toml file.

To use Sage Attention 3 in FastVideo, follow the README.md in the linked repository to install the package from source.

V-MoBA / SLA / SageSLA

These backends are model-specific and require the corresponding kernels and dependencies. Use the support matrix and model examples to confirm compatibility before enabling them.

Benchmarking different optimizations

To benchmark backend performance, generate the same prompt with the same seed and compare end-to-end generation times:

import os
import time

for backend in ["TORCH_SDPA", "FLASH_ATTN", "SAGE_ATTN"]:
    os.environ["FASTVIDEO_ATTENTION_BACKEND"] = backend
    generator = VideoGenerator.from_pretrained("your-model-id")
    start_time = time.perf_counter()
    generator.generate_video(
        prompt="Your prompt",
        seed=1024,
    )
    elapsed = time.perf_counter() - start_time
    print(f"{backend}: {elapsed:.2f}s")

Note: reinstantiate VideoGenerator after changing FASTVIDEO_ATTENTION_BACKEND.