Skip to content

ssim

Modules

fastvideo.tests.ssim.conftest

Functions:

fastvideo.tests.ssim.conftest.pytest_collection_modifyitems
pytest_collection_modifyitems(config, items)

Optionally keep only tests with a matching model_id parameter.

Source code in fastvideo/tests/ssim/conftest.py
def pytest_collection_modifyitems(config, items):
    """Optionally keep only tests with a matching model_id parameter."""
    model_id = os.environ.get("FASTVIDEO_SSIM_MODEL_ID")
    if not model_id:
        return

    selected = []
    deselected = []
    for item in items:
        callspec = getattr(item, "callspec", None)
        if callspec is None:
            deselected.append(item)
            continue
        if callspec.params.get("model_id") == model_id:
            selected.append(item)
        else:
            deselected.append(item)

    if deselected:
        config.hook.pytest_deselected(items=deselected)
    items[:] = selected

fastvideo.tests.ssim.latent_similarity_utils

Latent-space regression helpers for numerically fragile SSIM tests.

Motivation

Pixel-space SSIM is a poor regression signal for distilled / few-step models (e.g. LTX-2 distilled): a single mis-rounded bf16 accumulator in the VAE decoder can drive mean SSIM from ~0.95 to ~0.50 without any real quality regression.

Inspired by diffusers' "small signature slice + bounded full-tensor distance" testing philosophy, applied here to the pre-VAE latent rather than the decoded pixel/audio output:

  • tests/pipelines/ltx2/test_ltx2.py (diffusers) compares output_type='pt' (pixel) slices via torch.allclose(generated_slice, expected_slice, atol=1e-4);
  • tests/pipelines/stable_audio/test_stable_audio.py (diffusers) compares decoded audio samples via np.abs(expected - actual).max() < 1.5e-3;
  • tests/pipelines/cogvideo/test_cogvideox.py (diffusers) compares full pixel video tensors via numpy_cosine_similarity_distance(...) < 1e-3.

Diffusers does not assert on latents directly — that is a FastVideo adaptation. Distilled few-step pipelines amplify per-step bf16 noise enough that VAE-decoded comparisons are unreliable across our heterogeneous CI pool, so we move the assertion upstream of the VAE.

Design
  • Inference is run with output_type='latent' so DecodingStage hands back the un-decoded latent on result["samples"].
  • The reference artefact is a .pt bundle (tensor + metadata) hosted on the same HF dataset as the mp4 references, selected by <GPU>_reference_videos/<model_id>/<backend>/<prompt>.pt.
  • Two assertions are performed: 1. A small signature slice (default video: latent[0, :, 0, :3, :3]; audio: latent[0, :, :8]) is compared via cosine distance with a loose tolerance. Primary pass/fail gate. 2. The full latent is compared via cosine distance with a slightly tighter tolerance, guarding against shape-correct but globally drifted outputs.
  • Tolerances default to 5e-3 (slice) and 1e-2 (full). diffusers uses 1e-3 against deterministic CPU dummy components; we relax for cross-GPU-arch bf16 differences on the rented CI pool (A40/L40S/H100/B200).

The helper intentionally reuses build_init_kwargs / build_generation_kwargs from :mod:inference_similarity_utils so model params (vae tiling, sp_size, flow shift, …) flow through a single source of truth.

Classes

Functions:

fastvideo.tests.ssim.latent_similarity_utils.load_latent_reference
load_latent_reference(path: str) -> dict[str, Any]

Inverse of :func:save_latent_reference — always loads to cpu.

Enforces format_version == LATENT_REFERENCE_FORMAT_VERSION so a schema change forces a deliberate reseed instead of silently misinterpreting old artefacts.

Source code in fastvideo/tests/ssim/latent_similarity_utils.py
def load_latent_reference(path: str) -> dict[str, Any]:
    """Inverse of :func:`save_latent_reference` — always loads to cpu.

    Enforces ``format_version == LATENT_REFERENCE_FORMAT_VERSION`` so a
    schema change forces a deliberate reseed instead of silently
    misinterpreting old artefacts.
    """
    # ``weights_only=False`` is required because the payload is a dict of
    # tensors + plain-Python metadata (slice_spec, prompt, ...). The trust
    # boundary is the controlled HF dataset configured via
    # FASTVIDEO_SSIM_REFERENCE_HF_REPO (default
    # FastVideo/ssim-reference-videos), which is org-write-gated.
    payload = torch.load(path, map_location="cpu", weights_only=False)
    fmt = payload.get("format_version") if isinstance(payload, dict) else None
    if fmt != LATENT_REFERENCE_FORMAT_VERSION:
        raise ValueError(
            f"Latent reference at {path!r} has format_version={fmt!r}; "
            f"expected {LATENT_REFERENCE_FORMAT_VERSION}. Re-seed via the "
            "test that produces this artefact, then re-upload through "
            "fastvideo/tests/ssim/reference_videos_cli.py upload.")
    return payload
fastvideo.tests.ssim.latent_similarity_utils.run_text_to_latent_similarity_test
run_text_to_latent_similarity_test(*, logger: Logger, script_dir: str, device_reference_folder: str, prompt: str, attention_backend_name: str, model_id: str, default_params_map: dict[str, dict[str, object]], full_quality_params_map: dict[str, dict[str, object]], slice_cosine_threshold: float = 0.005, full_cosine_threshold: float = 0.01, init_kwargs_override: dict[str, object] | None = None, generation_kwargs_override: dict[str, object] | None = None, slice_spec: dict[str, Any] | None = None) -> dict[str, float]

Run T2V (or T2A) inference with output_type='latent' and compare to a reference latent.

Returns the computed metrics dict on success. Raises AssertionError if any cosine tolerance is exceeded and FileNotFoundError if the reference artefact is missing.

Source code in fastvideo/tests/ssim/latent_similarity_utils.py
def run_text_to_latent_similarity_test(
    *,
    logger: Logger,
    script_dir: str,
    device_reference_folder: str,
    prompt: str,
    attention_backend_name: str,
    model_id: str,
    default_params_map: dict[str, dict[str, object]],
    full_quality_params_map: dict[str, dict[str, object]],
    slice_cosine_threshold: float = 5e-3,
    full_cosine_threshold: float = 1e-2,
    init_kwargs_override: dict[str, object] | None = None,
    generation_kwargs_override: dict[str, object] | None = None,
    slice_spec: dict[str, Any] | None = None,
) -> dict[str, float]:
    """Run T2V (or T2A) inference with ``output_type='latent'`` and
    compare to a reference latent.

    Returns the computed metrics dict on success. Raises
    ``AssertionError`` if any cosine tolerance is exceeded and
    ``FileNotFoundError`` if the reference artefact is missing.
    """
    spec = slice_spec if slice_spec is not None else DEFAULT_SLICE_SPEC
    with attention_backend(attention_backend_name):
        output_dir = build_generated_output_dir(
            script_dir,
            device_reference_folder,
            model_id,
            attention_backend_name,
        )
        prompt_prefix = prompt[:100].strip()
        output_latent_name = f"{prompt_prefix}{LATENT_REFERENCE_EXTENSION}"
        os.makedirs(output_dir, exist_ok=True)

        params_map = select_ssim_params(
            default_params_map,
            full_quality_params_map,
        )
        base_params = params_map[model_id]
        num_inference_steps = int(base_params["num_inference_steps"])

        init_kwargs = build_init_kwargs(base_params)
        if init_kwargs_override:
            init_kwargs.update(init_kwargs_override)
        # Always wins: the helper exists specifically to compare on latents,
        # so an override can never silently turn it back into a pixel run.
        init_kwargs["output_type"] = "latent"

        generation_kwargs = build_generation_kwargs(
            base_params,
            num_inference_steps,
            output_dir,
        )
        # We serialize latents ourselves; skip the RGB encoder path.
        generation_kwargs["save_video"] = False
        generation_kwargs["return_frames"] = True
        if generation_kwargs_override:
            generation_kwargs.update(generation_kwargs_override)

        generator: VideoGenerator | None = None
        try:
            generator = VideoGenerator.from_pretrained(
                model_path=base_params["model_path"],
                **init_kwargs,
            )
            result = generator.generate_video(prompt, **generation_kwargs)
        finally:
            shutdown_executor(generator)

    gen_latent = _extract_latent_from_result(result)

    generated_latent_path = os.path.join(output_dir, output_latent_name)
    save_latent_reference(
        generated_latent_path,
        gen_latent,
        metadata={
            "prompt": prompt,
            "model_id": model_id,
            "attention_backend": attention_backend_name,
            "num_inference_steps": num_inference_steps,
        },
        slice_spec=spec,
    )
    logger.info("Saved generated latent to %s", generated_latent_path)

    reference_folder = build_reference_folder_path(
        script_dir,
        device_reference_folder,
        model_id,
        attention_backend_name,
    )
    if not os.path.exists(reference_folder):
        raise FileNotFoundError(
            f"Reference folder does not exist: {reference_folder}\n"
            f"To download references, run:\n"
            f"  python fastvideo/tests/ssim/reference_videos_cli.py download")

    reference_latent_path = os.path.join(
        reference_folder,
        output_latent_name,
    )
    if not os.path.exists(reference_latent_path):
        raise FileNotFoundError(
            "Reference latent missing for prompt/backend: "
            f"{reference_latent_path}")

    return _assert_latent_similarity(
        logger=logger,
        gen_latent=gen_latent,
        reference_path=reference_latent_path,
        slice_cosine_threshold=slice_cosine_threshold,
        full_cosine_threshold=full_cosine_threshold,
        model_id=model_id,
        attention_backend_name=attention_backend_name,
        output_dir=output_dir,
        generated_path=generated_latent_path,
        num_inference_steps=num_inference_steps,
        prompt=prompt,
    )
fastvideo.tests.ssim.latent_similarity_utils.save_latent_reference
save_latent_reference(path: str, latent: Tensor, *, metadata: dict[str, Any], slice_spec: dict[str, Any] | None = None) -> None

Persist a latent bundle to path.

Storage format (dict pickled via torch.save):

  • latent: full latent as fp16 on cpu
  • shape: original shape (list)
  • dtype_original: str
  • expected_slice: fp32 1-D signature slice
  • slice_spec: dict describing how the slice was built
  • metadata: caller-provided context (prompt, backend, steps, …)
  • format_version: int

fp16 is lossy but bounded; it keeps ref artefacts small (~a few MB per prompt) while preserving enough dynamic range for cosine-based regression. Slice values stay fp32 because the primary assertion is computed against them.

Source code in fastvideo/tests/ssim/latent_similarity_utils.py
def save_latent_reference(
    path: str,
    latent: torch.Tensor,
    *,
    metadata: dict[str, Any],
    slice_spec: dict[str, Any] | None = None,
) -> None:
    """Persist a latent bundle to ``path``.

    Storage format (dict pickled via ``torch.save``):

    * ``latent``: full latent as fp16 on cpu
    * ``shape``: original shape (list)
    * ``dtype_original``: str
    * ``expected_slice``: fp32 1-D signature slice
    * ``slice_spec``: dict describing how the slice was built
    * ``metadata``: caller-provided context (prompt, backend, steps, …)
    * ``format_version``: int

    fp16 is lossy but bounded; it keeps ref artefacts small (~a few MB
    per prompt) while preserving enough dynamic range for cosine-based
    regression. Slice values stay fp32 because the primary assertion is
    computed against them.
    """
    spec = slice_spec if slice_spec is not None else DEFAULT_SLICE_SPEC
    parent = os.path.dirname(path)
    if parent:
        os.makedirs(parent, exist_ok=True)

    latent_cpu = latent.detach().to("cpu")
    payload: dict[str, Any] = {
        "latent": latent_cpu.to(torch.float16),
        "shape": list(latent_cpu.shape),
        "dtype_original": str(latent_cpu.dtype),
        "expected_slice": _extract_expected_slice(latent_cpu, spec),
        "slice_spec": spec,
        "metadata": metadata,
        "format_version": LATENT_REFERENCE_FORMAT_VERSION,
    }
    torch.save(payload, path)
fastvideo.tests.ssim.latent_similarity_utils.write_latent_similarity_results
write_latent_similarity_results(output_dir: str, metrics: dict[str, float], *, reference_path: str, generated_path: str, num_inference_steps: int, prompt: str, model_id: str, attention_backend_name: str, slice_spec: dict[str, Any], slice_cosine_threshold: float, full_cosine_threshold: float, passed: bool) -> bool

Persist latent regression metrics next to the generated artefact.

Mirrors :func:fastvideo.tests.utils.write_ssim_results so downstream CI tooling can scrape one schema for both pixel and latent runs. The filename is steps{N}_{prompt[:100]}_latent.json.

Source code in fastvideo/tests/ssim/latent_similarity_utils.py
def write_latent_similarity_results(
    output_dir: str,
    metrics: dict[str, float],
    *,
    reference_path: str,
    generated_path: str,
    num_inference_steps: int,
    prompt: str,
    model_id: str,
    attention_backend_name: str,
    slice_spec: dict[str, Any],
    slice_cosine_threshold: float,
    full_cosine_threshold: float,
    passed: bool,
) -> bool:
    """Persist latent regression metrics next to the generated artefact.

    Mirrors :func:`fastvideo.tests.utils.write_ssim_results` so downstream
    CI tooling can scrape one schema for both pixel and latent runs.
    The filename is ``steps{N}_{prompt[:100]}_latent.json``.
    """
    try:
        os.makedirs(output_dir, exist_ok=True)
        prompt_prefix = prompt[:100].strip()
        filename = f"steps{num_inference_steps}_{prompt_prefix}_latent.json"
        target = os.path.join(output_dir, filename)
        payload = {
            "metrics": metrics,
            "reference_latent": reference_path,
            "generated_latent": generated_path,
            "model_id": model_id,
            "attention_backend": attention_backend_name,
            "slice_spec": slice_spec,
            "thresholds": {
                "slice_cosine": slice_cosine_threshold,
                "full_cosine": full_cosine_threshold,
            },
            "passed": passed,
            "parameters": {
                "num_inference_steps": num_inference_steps,
                "prompt": prompt,
            },
        }
        with open(target, "w", encoding="utf-8") as handle:
            json.dump(payload, handle, indent=2, sort_keys=True)
        return True
    except OSError:
        return False

fastvideo.tests.ssim.reference_videos_cli

Functions:

fastvideo.tests.ssim.reference_videos_cli.ensure_reference_videos_available
ensure_reference_videos_available(*, local_dir: Path | None = None, repo_id: str | None = None, repo_type: str | None = None, quality_tier: str = DEFAULT_OUTPUT_QUALITY_TIER) -> bool

Return True if downloaded from HF, False if already present locally.

Source code in fastvideo/tests/ssim/reference_videos_cli.py
def ensure_reference_videos_available(
    *,
    local_dir: Path | None = None,
    repo_id: str | None = None,
    repo_type: str | None = None,
    quality_tier: str = DEFAULT_OUTPUT_QUALITY_TIER,
) -> bool:
    """Return True if downloaded from HF, False if already present locally."""
    if quality_tier not in QUALITY_TIERS:
        raise ValueError(f"Unsupported quality tier: {quality_tier}")
    target_dir = local_dir or _ssim_dir()
    lock_path = target_dir / ".reference_videos_download.lock"
    with _exclusive_download_lock(lock_path):
        if _has_local_reference_videos(target_dir, quality_tier):
            print(f"Reference videos ({quality_tier}) already available at {target_dir}")
            return False

        resolved_repo_id = repo_id or _default_repo_id()
        resolved_repo_type = repo_type or _default_repo_type()
        if not resolved_repo_id:
            raise RuntimeError(
                f"No local reference videos found and no HF repo configured.\nSet {HF_REPO_ENV_KEY} or pass --repo-id."
            )

        print(f"Repo ID: {resolved_repo_id}")
        print(f"Quality tier: {quality_tier}")
        print(f"No local {quality_tier} reference videos found under {target_dir}. Starting download...")
        try:
            download_reference_videos(
                repo_id=resolved_repo_id,
                repo_type=resolved_repo_type,
                local_dir=target_dir,
                quality_tiers=[quality_tier],
            )
            print(f"Download completed for {quality_tier} reference videos.")
        except Exception as exc:
            print(f"ERROR: Failed to download {quality_tier} reference videos from {resolved_repo_id}.")
            print(
                f"Suggested command to retry: "
                f"python fastvideo/tests/ssim/reference_videos_cli.py download "
                f"--quality-tier {quality_tier}"
            )
            raise

        if not _has_local_reference_videos(target_dir, quality_tier):
            raise RuntimeError(f"HF download completed but no {quality_tier} *_reference_videos content found.")
        return True

fastvideo.tests.ssim.test_flux2_similarity

Latent-slice regression tests for Flux2 text-to-image variants.

Flux2 currently has local parity coverage against the official/reference pipeline, but CI needs a small deterministic regression gate for seeded HF artefacts. Pixel-space comparisons are unnecessarily brittle for this first slot, so the test follows the latent helper pattern used by LTX-2: generate a single-image latent with the production recipe, persist the generated latent, and compare a stable latent signature plus the full tensor against the device reference.

The default and full-quality parameter maps intentionally carry the same recipe values for now. The --ssim-full-quality flag still switches the reference tier through conftest.py; separate full-quality recipes can be introduced after the initial Flux2 references have a stable CI window.

Functions:

fastvideo.tests.ssim.test_gamecraft_similarity

SSIM regression test for HunyuanGameCraft (T2V and I2V).

Generates a video with deterministic seed and camera trajectory, then compares against a device-specific reference video via MS-SSIM.

Reference videos must be pre-generated and stored under

reference_videos//_reference_videos/HunyuanGameCraft/ /

To create initial reference videos, run this test once and copy the generated videos into the appropriate reference folder.

Classes

Functions:

fastvideo.tests.ssim.test_gamecraft_similarity.test_gamecraft_i2v_similarity
test_gamecraft_i2v_similarity(prompt, ATTENTION_BACKEND, model_id)

Generate an I2V video with GameCraft and compare to reference via SSIM.

Source code in fastvideo/tests/ssim/test_gamecraft_similarity.py
@pytest.mark.parametrize("prompt", I2V_TEST_PROMPTS)
@pytest.mark.parametrize("ATTENTION_BACKEND", ["FLASH_ATTN"])
@pytest.mark.parametrize("model_id", list(I2V_MODEL_TO_PARAMS.keys()))
def test_gamecraft_i2v_similarity(prompt, ATTENTION_BACKEND, model_id):
    """Generate an I2V video with GameCraft and compare to reference via SSIM."""
    os.environ["FASTVIDEO_ATTENTION_BACKEND"] = ATTENTION_BACKEND

    script_dir = os.path.dirname(os.path.abspath(__file__))
    output_dir = build_generated_output_dir(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )
    output_video_name = f"{prompt[:100].strip()}.mp4"

    os.makedirs(output_dir, exist_ok=True)

    params_map = select_ssim_params(
        I2V_MODEL_TO_PARAMS,
        FULL_QUALITY_I2V_MODEL_TO_PARAMS,
    )
    BASE_PARAMS = params_map[model_id]
    num_inference_steps = BASE_PARAMS["num_inference_steps"]

    # Build camera trajectory
    camera_states = _create_camera_trajectory(
        action=BASE_PARAMS["action"],
        height=BASE_PARAMS["height"],
        width=BASE_PARAMS["width"],
        num_frames=BASE_PARAMS["num_frames"],
        action_speed=BASE_PARAMS["action_speed"],
        dtype=torch.bfloat16,
    )

    init_kwargs = {
        "num_gpus": BASE_PARAMS["num_gpus"],
        "use_fsdp_inference": True,
        "dit_cpu_offload": True,
        "vae_cpu_offload": True,
        "text_encoder_cpu_offload": True,
        "pin_cpu_memory": True,
    }

    generation_kwargs = {
        "num_inference_steps": num_inference_steps,
        "output_path": output_dir,
        "image_path": BASE_PARAMS["image_path"],
        "height": BASE_PARAMS["height"],
        "width": BASE_PARAMS["width"],
        "num_frames": BASE_PARAMS["num_frames"],
        "guidance_scale": BASE_PARAMS["guidance_scale"],
        "seed": BASE_PARAMS["seed"],
        "fps": 24,
        "camera_states": camera_states,
        "negative_prompt": BASE_PARAMS.get("negative_prompt", ""),
        "save_video": True,
    }

    generator: VideoGenerator | None = None
    try:
        generator = VideoGenerator.from_pretrained(
            model_path=BASE_PARAMS["model_path"], **init_kwargs
        )
        generator.generate_video(prompt, **generation_kwargs)
    finally:
        _shutdown_executor(generator)

    assert os.path.exists(output_dir), (
        f"Output video was not generated at {output_dir}"
    )

    # Compare to reference
    reference_folder = build_reference_folder_path(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )

    if not os.path.exists(reference_folder):
        logger.error("Reference folder missing")
        raise FileNotFoundError(
            f"Reference video folder does not exist: {reference_folder}"
        )

    reference_video_name = None
    for filename in os.listdir(reference_folder):
        if filename.endswith(".mp4") and prompt[:100].strip() in filename:
            reference_video_name = filename
            break

    if not reference_video_name:
        logger.error(
            f"Reference video not found for prompt: {prompt} "
            f"with backend: {ATTENTION_BACKEND}"
        )
        raise FileNotFoundError("Reference video missing")

    reference_video_path = os.path.join(reference_folder, reference_video_name)
    generated_video_path = os.path.join(output_dir, output_video_name)

    logger.info(
        f"Computing SSIM between {reference_video_path} and {generated_video_path}"
    )
    ssim_values = compute_video_ssim_torchvision(
        reference_video_path, generated_video_path, use_ms_ssim=True
    )

    mean_ssim = ssim_values[0]
    logger.info(f"SSIM mean value: {mean_ssim}")
    logger.info(f"Writing SSIM results to directory: {output_dir}")

    success = write_ssim_results(
        output_dir,
        ssim_values,
        reference_video_path,
        generated_video_path,
        num_inference_steps,
        prompt,
    )

    if not success:
        logger.error("Failed to write SSIM results to file")

    min_acceptable_ssim = 0.93
    assert mean_ssim >= min_acceptable_ssim, (
        f"SSIM value {mean_ssim} is below threshold {min_acceptable_ssim} "
        f"for {model_id} with backend {ATTENTION_BACKEND}"
    )
fastvideo.tests.ssim.test_gamecraft_similarity.test_gamecraft_t2v_similarity
test_gamecraft_t2v_similarity(prompt, ATTENTION_BACKEND, model_id)

Generate a T2V video with GameCraft and compare to reference via SSIM.

Source code in fastvideo/tests/ssim/test_gamecraft_similarity.py
@pytest.mark.parametrize("prompt", TEST_PROMPTS)
@pytest.mark.parametrize("ATTENTION_BACKEND", ["FLASH_ATTN"])
@pytest.mark.parametrize("model_id", list(MODEL_TO_PARAMS.keys()))
def test_gamecraft_t2v_similarity(prompt, ATTENTION_BACKEND, model_id):
    """Generate a T2V video with GameCraft and compare to reference via SSIM."""
    os.environ["FASTVIDEO_ATTENTION_BACKEND"] = ATTENTION_BACKEND

    script_dir = os.path.dirname(os.path.abspath(__file__))
    output_dir = build_generated_output_dir(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )
    output_video_name = f"{prompt[:100].strip()}.mp4"

    os.makedirs(output_dir, exist_ok=True)

    params_map = select_ssim_params(
        MODEL_TO_PARAMS,
        FULL_QUALITY_MODEL_TO_PARAMS,
    )
    BASE_PARAMS = params_map[model_id]
    num_inference_steps = BASE_PARAMS["num_inference_steps"]

    # Build camera trajectory
    camera_states = _create_camera_trajectory(
        action=BASE_PARAMS["action"],
        height=BASE_PARAMS["height"],
        width=BASE_PARAMS["width"],
        num_frames=BASE_PARAMS["num_frames"],
        action_speed=BASE_PARAMS["action_speed"],
        dtype=torch.bfloat16,
    )

    init_kwargs = {
        "num_gpus": BASE_PARAMS["num_gpus"],
        "use_fsdp_inference": True,
        "dit_cpu_offload": True,
        "vae_cpu_offload": True,
        "text_encoder_cpu_offload": True,
        "pin_cpu_memory": True,
    }

    generation_kwargs = {
        "num_inference_steps": num_inference_steps,
        "output_path": output_dir,
        "height": BASE_PARAMS["height"],
        "width": BASE_PARAMS["width"],
        "num_frames": BASE_PARAMS["num_frames"],
        "guidance_scale": BASE_PARAMS["guidance_scale"],
        "seed": BASE_PARAMS["seed"],
        "fps": 24,
        "camera_states": camera_states,
        "negative_prompt": BASE_PARAMS.get("negative_prompt", ""),
        "save_video": True,
    }

    generator: VideoGenerator | None = None
    try:
        generator = VideoGenerator.from_pretrained(
            model_path=BASE_PARAMS["model_path"], **init_kwargs
        )
        generator.generate_video(prompt, **generation_kwargs)
    finally:
        _shutdown_executor(generator)

    assert os.path.exists(output_dir), (
        f"Output video was not generated at {output_dir}"
    )

    # Compare to reference
    reference_folder = build_reference_folder_path(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )

    if not os.path.exists(reference_folder):
        logger.error("Reference folder missing")
        raise FileNotFoundError(
            f"Reference video folder does not exist: {reference_folder}"
        )

    reference_video_name = None
    for filename in os.listdir(reference_folder):
        if filename.endswith(".mp4") and prompt[:100].strip() in filename:
            reference_video_name = filename
            break

    if not reference_video_name:
        logger.error(
            f"Reference video not found for prompt: {prompt} "
            f"with backend: {ATTENTION_BACKEND}"
        )
        raise FileNotFoundError("Reference video missing")

    reference_video_path = os.path.join(reference_folder, reference_video_name)
    generated_video_path = os.path.join(output_dir, output_video_name)

    logger.info(
        f"Computing SSIM between {reference_video_path} and {generated_video_path}"
    )
    ssim_values = compute_video_ssim_torchvision(
        reference_video_path, generated_video_path, use_ms_ssim=True
    )

    mean_ssim = ssim_values[0]
    logger.info(f"SSIM mean value: {mean_ssim}")
    logger.info(f"Writing SSIM results to directory: {output_dir}")

    success = write_ssim_results(
        output_dir,
        ssim_values,
        reference_video_path,
        generated_video_path,
        num_inference_steps,
        prompt,
    )

    if not success:
        logger.error("Failed to write SSIM results to file")

    min_acceptable_ssim = 0.93
    assert mean_ssim >= min_acceptable_ssim, (
        f"SSIM value {mean_ssim} is below threshold {min_acceptable_ssim} "
        f"for {model_id} with backend {ATTENTION_BACKEND}"
    )

fastvideo.tests.ssim.test_gen3c_similarity

SSIM regression test for GEN3C video generation.

Compares newly generated GEN3C videos against device-specific reference videos using MS-SSIM to detect quality regressions across code changes.

Usage

Requires 1+ GPU and reference videos.

pytest fastvideo/tests/ssim/test_gen3c_similarity.py -v

Environment variables

GEN3C_MODEL_PATH - Diffusers-format GEN3C model path/repo id. Default: FastVideo/GEN3C-Cosmos-7B-Diffusers (local converted path also supported)

Classes

Functions:

fastvideo.tests.ssim.test_gen3c_similarity.test_gen3c_inference_similarity
test_gen3c_inference_similarity(prompt, ATTENTION_BACKEND, model_id)

Generate a GEN3C video and compare against the reference using MS-SSIM.

Source code in fastvideo/tests/ssim/test_gen3c_similarity.py
@pytest.mark.skipif(
    device_reference_folder is None,
    reason=f"No reference videos for device {device_name}",
)
@pytest.mark.parametrize("prompt", TEST_PROMPTS)
@pytest.mark.parametrize("ATTENTION_BACKEND", ["TORCH_SDPA"])
@pytest.mark.parametrize("model_id", list(MODEL_TO_PARAMS.keys()))
def test_gen3c_inference_similarity(prompt, ATTENTION_BACKEND, model_id):
    """
    Generate a GEN3C video and compare against the reference using MS-SSIM.
    """
    os.environ["FASTVIDEO_ATTENTION_BACKEND"] = ATTENTION_BACKEND

    script_dir = os.path.dirname(os.path.abspath(__file__))
    base_output_dir = os.path.join(script_dir, "generated_videos", model_id)
    output_dir = os.path.join(base_output_dir, ATTENTION_BACKEND)
    output_video_name = CANDIDATE_VIDEO_NAME
    os.makedirs(output_dir, exist_ok=True)

    BASE_PARAMS = MODEL_TO_PARAMS[model_id]
    num_inference_steps = BASE_PARAMS["num_inference_steps"]
    model_path = BASE_PARAMS["model_path"]

    # Guard common misconfigurations to keep CI behavior explicit.
    if model_path.lower() == "nvidia/gen3c-cosmos-7b":
        pytest.skip(
            "nvidia/GEN3C-Cosmos-7B is the official raw checkpoint repo, not Diffusers format. "
            "Use GEN3C_MODEL_PATH=FastVideo/GEN3C-Cosmos-7B-Diffusers or a local converted path."
        )

    local_like = model_path.startswith(("/", "./", "../"))
    if local_like and not os.path.exists(model_path):
        pytest.skip(
            f"Local GEN3C model path not found: {model_path}. "
            "Set GEN3C_MODEL_PATH to a valid local path or HF Diffusers repo id."
        )

    if os.path.exists(model_path):
        model_index_path = os.path.join(model_path, "model_index.json")
        if not os.path.exists(model_index_path):
            pytest.skip(
                f"GEN3C_MODEL_PATH is not Diffusers-format (missing model_index.json): {model_path}"
            )

    init_kwargs = {
        "num_gpus": BASE_PARAMS["num_gpus"],
        "sp_size": BASE_PARAMS["sp_size"],
        "tp_size": BASE_PARAMS["tp_size"],
    }
    if "flow_shift" in BASE_PARAMS:
        init_kwargs["flow_shift"] = BASE_PARAMS["flow_shift"]

    generation_kwargs = {
        "num_inference_steps": num_inference_steps,
        "output_path": os.path.join(output_dir, output_video_name),
        "height": BASE_PARAMS["height"],
        "width": BASE_PARAMS["width"],
        "num_frames": BASE_PARAMS["num_frames"],
        "guidance_scale": BASE_PARAMS["guidance_scale"],
        "embedded_cfg_scale": BASE_PARAMS["embedded_cfg_scale"],
        "seed": BASE_PARAMS["seed"],
        "image_path": BASE_PARAMS["image_path"],
        "fps": BASE_PARAMS["fps"],
    }

    if not os.path.exists(generation_kwargs["image_path"]):
        pytest.skip(
            f"GEN3C test image not found: {generation_kwargs['image_path']}. "
            "Set GEN3C_TEST_IMAGE_PATH to a valid local image."
        )

    # Keep local reruns deterministic: remove prior candidate outputs so
    # VideoGenerator does not auto-suffix (_1, _2, ...).
    stale_pattern = os.path.join(output_dir, "gen3c_ssim_candidate*.mp4")
    for stale_video in glob.glob(stale_pattern):
        os.remove(stale_video)

    generator = VideoGenerator.from_pretrained(
        model_path=model_path, **init_kwargs
    )
    generator.generate_video(prompt, **generation_kwargs)

    if isinstance(generator.executor, MultiprocExecutor):
        generator.executor.shutdown()

    assert os.path.exists(output_dir), f"Output not generated at {output_dir}"

    reference_folder = os.path.join(
        script_dir, device_reference_folder, model_id, ATTENTION_BACKEND
    )
    if not os.path.exists(reference_folder):
        raise FileNotFoundError(
            f"Reference video folder does not exist: {reference_folder}"
        )

    reference_video_path = os.path.join(reference_folder, BASELINE_VIDEO_NAME)
    if not os.path.exists(reference_video_path):
        raise FileNotFoundError(
            f"Reference video not found: {reference_video_path}"
        )

    generated_video_path = os.path.join(output_dir, output_video_name)

    logger.info(f"Computing SSIM: {reference_video_path} vs {generated_video_path}")
    ssim_values = compute_video_ssim_torchvision(
        reference_video_path, generated_video_path, use_ms_ssim=True
    )

    mean_ssim = ssim_values[0]
    logger.info(f"GEN3C SSIM mean: {mean_ssim}")

    write_ssim_results(
        output_dir,
        ssim_values,
        reference_video_path,
        generated_video_path,
        num_inference_steps,
        prompt,
    )

    # GEN3C SSIM threshold for stable L40S reference comparisons.
    min_acceptable_ssim = 0.93
    assert mean_ssim >= min_acceptable_ssim, (
        f"SSIM {mean_ssim:.4f} < {min_acceptable_ssim} for {model_id} / {ATTENTION_BACKEND}"
    )

fastvideo.tests.ssim.test_lingbot_similarity

SSIM-based similarity test for LingBotWorld I2V with camera control.

Camera trajectory is loaded from LingBot example npy files (poses.npy/intrinsics.npy), matching the official script workflow.

Note: num_inference_steps is reduced to 4 for faster CI.

Classes

Functions:

fastvideo.tests.ssim.test_longcat_similarity

SSIM-based similarity tests for LongCat video generation.

Tests three LongCat modes: - T2V (Text-to-Video): 480p video from text prompt - I2V (Image-to-Video): 480p video from image + text prompt
- VC (Video Continuation): 480p video continuation from input video + text prompt

Sampling parameters are derived from: - examples/inference/basic/basic_longcat_t2v.py - examples/inference/basic/basic_longcat_i2v.py - examples/inference/basic/basic_longcat_vc.py

Note: num_inference_steps is reduced for CI speed (4 steps vs 50 in examples).

Classes

Functions:

fastvideo.tests.ssim.test_longcat_similarity.test_longcat_i2v_similarity
test_longcat_i2v_similarity(prompt: str, ATTENTION_BACKEND: str)

Test LongCat I2V inference and compare output to reference videos using SSIM.

Parameters derived from examples/inference/basic/basic_longcat_i2v.py

Source code in fastvideo/tests/ssim/test_longcat_similarity.py
@pytest.mark.parametrize("prompt", I2V_TEST_PROMPTS)
@pytest.mark.parametrize("ATTENTION_BACKEND", ["FLASH_ATTN"])
def test_longcat_i2v_similarity(prompt: str, ATTENTION_BACKEND: str):
    """
    Test LongCat I2V inference and compare output to reference videos using SSIM.

    Parameters derived from examples/inference/basic/basic_longcat_i2v.py
    """
    os.environ["FASTVIDEO_ATTENTION_BACKEND"] = ATTENTION_BACKEND

    params = select_ssim_params(LONGCAT_I2V_PARAMS, LONGCAT_I2V_FULL_QUALITY_PARAMS)

    script_dir = os.path.dirname(os.path.abspath(__file__))
    model_id = "LongCat-Video-I2V"

    output_dir = build_generated_output_dir(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )
    output_video_name = f"{prompt[:100].strip()}.mp4"
    os.makedirs(output_dir, exist_ok=True)

    # Get image path for this prompt
    prompt_idx = I2V_TEST_PROMPTS.index(prompt)
    image_path = _resolve_asset_path(I2V_IMAGE_PATHS[prompt_idx])

    init_kwargs = {
        "num_gpus": params["num_gpus"],
        "use_fsdp_inference": True,
        "dit_cpu_offload": True,
        "vae_cpu_offload": True,
        "text_encoder_cpu_offload": True,
        "enable_bsa": False,
    }

    generation_kwargs = {
        "output_path": output_dir,
        "image_path": image_path,
        "height": params["height"],
        "width": params["width"],
        "num_frames": params["num_frames"],
        "num_inference_steps": params["num_inference_steps"],
        "guidance_scale": params["guidance_scale"],
        "fps": params["fps"],
        "seed": params["seed"],
        "negative_prompt": params["negative_prompt"],
    }

    generator = VideoGenerator.from_pretrained(
        model_path=params["model_path"], **init_kwargs
    )
    generator.generate_video(prompt, **generation_kwargs)
    generator.shutdown()

    generated_video_path = os.path.join(output_dir, output_video_name)
    assert os.path.exists(generated_video_path), (
        f"Output video was not generated at {generated_video_path}"
    )

    # Find reference video
    reference_folder = build_reference_folder_path(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )
    if not os.path.exists(reference_folder):
        raise FileNotFoundError(
            f"Reference video folder does not exist: {reference_folder}"
        )

    reference_video_name = None
    for filename in os.listdir(reference_folder):
        if filename.endswith(".mp4") and prompt[:100].strip() in filename:
            reference_video_name = filename
            break

    if not reference_video_name:
        raise FileNotFoundError(
            f"Reference video not found for prompt: {prompt[:50]}... with backend: {ATTENTION_BACKEND}"
        )

    reference_video_path = os.path.join(reference_folder, reference_video_name)

    logger.info(f"Computing SSIM between {reference_video_path} and {generated_video_path}")
    ssim_values = compute_video_ssim_torchvision(
        reference_video_path, generated_video_path, use_ms_ssim=True
    )

    mean_ssim = ssim_values[0]
    logger.info(f"SSIM mean value: {mean_ssim}")

    write_ssim_results(
        output_dir, ssim_values, reference_video_path, generated_video_path,
        params["num_inference_steps"], prompt
    )

    min_acceptable_ssim = 0.90
    assert mean_ssim >= min_acceptable_ssim, (
        f"SSIM value {mean_ssim} is below threshold {min_acceptable_ssim} "
        f"for {model_id} with backend {ATTENTION_BACKEND}"
    )
fastvideo.tests.ssim.test_longcat_similarity.test_longcat_t2v_similarity
test_longcat_t2v_similarity(prompt: str, ATTENTION_BACKEND: str)

Test LongCat T2V inference and compare output to reference videos using SSIM.

Parameters derived from examples/inference/basic/basic_longcat_t2v.py

Source code in fastvideo/tests/ssim/test_longcat_similarity.py
@pytest.mark.parametrize("prompt", T2V_TEST_PROMPTS)
@pytest.mark.parametrize("ATTENTION_BACKEND", ["FLASH_ATTN"])
def test_longcat_t2v_similarity(prompt: str, ATTENTION_BACKEND: str):
    """
    Test LongCat T2V inference and compare output to reference videos using SSIM.

    Parameters derived from examples/inference/basic/basic_longcat_t2v.py
    """
    os.environ["FASTVIDEO_ATTENTION_BACKEND"] = ATTENTION_BACKEND

    params = select_ssim_params(LONGCAT_T2V_PARAMS, LONGCAT_T2V_FULL_QUALITY_PARAMS)

    script_dir = os.path.dirname(os.path.abspath(__file__))
    model_id = "LongCat-Video-T2V"

    output_dir = build_generated_output_dir(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )
    output_video_name = f"{prompt[:100].strip()}.mp4"
    os.makedirs(output_dir, exist_ok=True)

    init_kwargs = {
        "num_gpus": params["num_gpus"],
        "use_fsdp_inference": True,
        "dit_cpu_offload": True,
        "vae_cpu_offload": True,
        "text_encoder_cpu_offload": True,
        "enable_bsa": False,
    }

    generation_kwargs = {
        "output_path": output_dir,
        "height": params["height"],
        "width": params["width"],
        "num_frames": params["num_frames"],
        "num_inference_steps": params["num_inference_steps"],
        "guidance_scale": params["guidance_scale"],
        "fps": params["fps"],
        "seed": params["seed"],
        "negative_prompt": params["negative_prompt"],
    }

    generator = VideoGenerator.from_pretrained(
        model_path=params["model_path"], **init_kwargs
    )
    generator.generate_video(prompt, **generation_kwargs)
    generator.shutdown()

    generated_video_path = os.path.join(output_dir, output_video_name)
    assert os.path.exists(generated_video_path), (
        f"Output video was not generated at {generated_video_path}"
    )

    # Find reference video
    reference_folder = build_reference_folder_path(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )
    if not os.path.exists(reference_folder):
        raise FileNotFoundError(
            f"Reference video folder does not exist: {reference_folder}"
        )

    reference_video_name = None
    for filename in os.listdir(reference_folder):
        if filename.endswith(".mp4") and prompt[:100].strip() in filename:
            reference_video_name = filename
            break

    if not reference_video_name:
        raise FileNotFoundError(
            f"Reference video not found for prompt: {prompt[:50]}... with backend: {ATTENTION_BACKEND}"
        )

    reference_video_path = os.path.join(reference_folder, reference_video_name)

    logger.info(f"Computing SSIM between {reference_video_path} and {generated_video_path}")
    ssim_values = compute_video_ssim_torchvision(
        reference_video_path, generated_video_path, use_ms_ssim=True
    )

    mean_ssim = ssim_values[0]
    logger.info(f"SSIM mean value: {mean_ssim}")

    write_ssim_results(
        output_dir, ssim_values, reference_video_path, generated_video_path,
        params["num_inference_steps"], prompt
    )

    min_acceptable_ssim = 0.90
    assert mean_ssim >= min_acceptable_ssim, (
        f"SSIM value {mean_ssim} is below threshold {min_acceptable_ssim} "
        f"for {model_id} with backend {ATTENTION_BACKEND}"
    )
fastvideo.tests.ssim.test_longcat_similarity.test_longcat_vc_similarity
test_longcat_vc_similarity(prompt: str, ATTENTION_BACKEND: str)

Test LongCat VC (Video Continuation) inference and compare output to reference videos using SSIM.

Parameters derived from examples/inference/basic/basic_longcat_vc.py

Source code in fastvideo/tests/ssim/test_longcat_similarity.py
@pytest.mark.parametrize("prompt", VC_TEST_PROMPTS)
@pytest.mark.parametrize("ATTENTION_BACKEND", ["FLASH_ATTN"])
def test_longcat_vc_similarity(prompt: str, ATTENTION_BACKEND: str):
    """
    Test LongCat VC (Video Continuation) inference and compare output to reference videos using SSIM.

    Parameters derived from examples/inference/basic/basic_longcat_vc.py
    """
    os.environ["FASTVIDEO_ATTENTION_BACKEND"] = ATTENTION_BACKEND

    params = select_ssim_params(LONGCAT_VC_PARAMS, LONGCAT_VC_FULL_QUALITY_PARAMS)

    script_dir = os.path.dirname(os.path.abspath(__file__))
    model_id = "LongCat-Video-VC"

    output_dir = build_generated_output_dir(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )
    output_video_name = f"{prompt[:100].strip()}.mp4"
    os.makedirs(output_dir, exist_ok=True)

    # Get video path for this prompt
    prompt_idx = VC_TEST_PROMPTS.index(prompt)
    video_path = _resolve_asset_path(VC_VIDEO_PATHS[prompt_idx])

    if not os.path.exists(video_path):
        pytest.skip(f"Input video not found at {video_path}")

    init_kwargs = {
        "num_gpus": params["num_gpus"],
        "use_fsdp_inference": False,
        "dit_cpu_offload": False,
        "vae_cpu_offload": True,
        "text_encoder_cpu_offload": True,
        "pin_cpu_memory": False,
        "enable_bsa": False,
    }

    generation_kwargs = {
        "output_path": output_dir,
        "video_path": video_path,
        "num_cond_frames": params["num_cond_frames"],
        "height": params["height"],
        "width": params["width"],
        "num_frames": params["num_frames"],
        "num_inference_steps": params["num_inference_steps"],
        "guidance_scale": params["guidance_scale"],
        "fps": params["fps"],
        "seed": params["seed"],
        "negative_prompt": params["negative_prompt"],
    }

    generator = VideoGenerator.from_pretrained(
        model_path=params["model_path"], **init_kwargs
    )
    generator.generate_video(prompt, **generation_kwargs)
    generator.shutdown()

    generated_video_path = os.path.join(output_dir, output_video_name)
    assert os.path.exists(generated_video_path), (
        f"Output video was not generated at {generated_video_path}"
    )

    # Find reference video
    reference_folder = build_reference_folder_path(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )
    if not os.path.exists(reference_folder):
        raise FileNotFoundError(
            f"Reference video folder does not exist: {reference_folder}"
        )

    reference_video_name = None
    for filename in os.listdir(reference_folder):
        if filename.endswith(".mp4") and prompt[:100].strip() in filename:
            reference_video_name = filename
            break

    if not reference_video_name:
        raise FileNotFoundError(
            f"Reference video not found for prompt: {prompt[:50]}... with backend: {ATTENTION_BACKEND}"
        )

    reference_video_path = os.path.join(reference_folder, reference_video_name)

    logger.info(f"Computing SSIM between {reference_video_path} and {generated_video_path}")
    ssim_values = compute_video_ssim_torchvision(
        reference_video_path, generated_video_path, use_ms_ssim=True
    )

    mean_ssim = ssim_values[0]
    logger.info(f"SSIM mean value: {mean_ssim}")

    write_ssim_results(
        output_dir, ssim_values, reference_video_path, generated_video_path,
        params["num_inference_steps"], prompt
    )

    min_acceptable_ssim = 0.90
    assert mean_ssim >= min_acceptable_ssim, (
        f"SSIM value {mean_ssim} is below threshold {min_acceptable_ssim} "
        f"for {model_id} with backend {ATTENTION_BACKEND}"
    )

fastvideo.tests.ssim.test_ltx2_similarity

Latent-slice regression test for LTX-2 distilled text-to-video.

Pixel-space SSIM is not a useful signal for this model: 4 distilled steps + bf16 attention + tiled VAE decode produce outputs that pass visual QA but occupy a very wide region in pixel space.

Inspired by diffusers' slice-vs-full regression philosophy — see diffusers/tests/pipelines/ltx2/test_ltx2.py (compares pixel slices via torch.allclose(..., atol=1e-4)) and diffusers/tests/pipelines/cogvideo/test_cogvideox.py (full pixel tensors via numpy_cosine_similarity_distance(...) < 1e-3). Diffusers itself does NOT compare latents; we apply the same "small signature slice + bounded full-tensor distance" idea to the pre-VAE latent because distilled few-step pipelines amplify per-step bf16 noise enough that VAE-decoded comparisons are unreliable.

Parameters are kept identical to the original SSIM run so that reference artefacts generated on Modal L40S remain bit-compatible with production inference.

Classes

Functions:

fastvideo.tests.ssim.test_matrixgame2_similarity

Classes

Functions:

fastvideo.tests.ssim.test_matrixgame2_similarity.test_matrixgame2_similarity
test_matrixgame2_similarity(prompt, ATTENTION_BACKEND, model_id)

Test that runs inference with different parameters and compares the output to reference videos using SSIM.

Source code in fastvideo/tests/ssim/test_matrixgame2_similarity.py
@pytest.mark.parametrize("prompt", TEST_PROMPTS)
@pytest.mark.parametrize("ATTENTION_BACKEND", ["FLASH_ATTN"])
@pytest.mark.parametrize("model_id", list(MODEL_TO_PARAMS.keys()))
def test_matrixgame2_similarity(prompt, ATTENTION_BACKEND, model_id):
    """
    Test that runs inference with different parameters and compares the output
    to reference videos using SSIM.
    """
    os.environ["FASTVIDEO_ATTENTION_BACKEND"] = ATTENTION_BACKEND

    script_dir = os.path.dirname(os.path.abspath(__file__))

    output_dir = build_generated_output_dir(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )
    output_video_name = "output.mp4"

    os.makedirs(output_dir, exist_ok=True)

    params_map = select_ssim_params(
        MODEL_TO_PARAMS,
        FULL_QUALITY_MODEL_TO_PARAMS,
    )
    BASE_PARAMS = params_map[model_id]
    num_inference_steps = BASE_PARAMS["num_inference_steps"]

    # Create action conditions for Matrix-Game 2.0
    actions = create_action_presets(
        BASE_PARAMS["num_frames"], keyboard_dim=BASE_PARAMS["keyboard_dim"], seed=BASE_PARAMS["seed"]
    )
    latent_frames = (BASE_PARAMS["num_frames"] - 1) // 4 + 1
    grid_sizes = torch.tensor([latent_frames, 44, 80])

    init_kwargs = {
        "num_gpus": BASE_PARAMS["num_gpus"],
        "use_fsdp_inference": True,
        "dit_layerwise_offload": False,
        "dit_cpu_offload": False,
        "vae_cpu_offload": False,
        "text_encoder_cpu_offload": True,
        "pin_cpu_memory": True,
    }

    generation_kwargs = {
        "num_inference_steps": num_inference_steps,
        "output_path": output_dir,
        "image_path": TEST_IMAGE_PATHS[0],
        "height": BASE_PARAMS["height"],
        "width": BASE_PARAMS["width"],
        "num_frames": BASE_PARAMS["num_frames"],
        "seed": BASE_PARAMS["seed"],
        "mouse_cond": actions["mouse"].unsqueeze(0),
        "keyboard_cond": actions["keyboard"].unsqueeze(0),
        "grid_sizes": grid_sizes,
        "save_video": True,
    }

    generator = VideoGenerator.from_pretrained(model_path=BASE_PARAMS["model_path"], **init_kwargs)
    generator.generate_video(prompt, **generation_kwargs)

    if isinstance(generator.executor, MultiprocExecutor):
        generator.executor.shutdown()

    assert os.path.exists(output_dir), f"Output video was not generated at {output_dir}"

    reference_folder = build_reference_folder_path(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )

    if not os.path.exists(reference_folder):
        logger.error("Reference folder missing")
        raise FileNotFoundError(f"Reference video folder does not exist: {reference_folder}")

    # Find the matching reference video based on the prompt
    reference_video_name = None

    for filename in os.listdir(reference_folder):
        if filename.endswith(".mp4"):
            reference_video_name = filename
            break

    if not reference_video_name:
        logger.error(f"Reference video not found for model: {model_id} with backend: {ATTENTION_BACKEND}")
        raise FileNotFoundError("Reference video missing")

    reference_video_path = os.path.join(reference_folder, reference_video_name)
    generated_video_path = os.path.join(output_dir, output_video_name)

    logger.info(f"Computing SSIM between {reference_video_path} and {generated_video_path}")
    ssim_values = compute_video_ssim_torchvision(reference_video_path, generated_video_path, use_ms_ssim=True)

    mean_ssim = ssim_values[0]
    logger.info(f"SSIM mean value: {mean_ssim}")
    logger.info(f"Writing SSIM results to directory: {output_dir}")

    success = write_ssim_results(
        output_dir,
        ssim_values,
        reference_video_path,
        generated_video_path,
        num_inference_steps,
        prompt,
    )

    if not success:
        logger.error("Failed to write SSIM results to file")

    min_acceptable_ssim = 0.98
    assert mean_ssim >= min_acceptable_ssim, (
        f"SSIM value {mean_ssim} is below threshold {min_acceptable_ssim} for {model_id} with backend {ATTENTION_BACKEND}"
    )

fastvideo.tests.ssim.test_matrixgame3_similarity

Classes

Functions:

fastvideo.tests.ssim.test_matrixgame3_similarity.test_matrixgame3_similarity
test_matrixgame3_similarity(prompt, ATTENTION_BACKEND, model_id)

Test that runs MG3 inference (action conditions auto-generated from seed) and compares the output to reference videos using SSIM.

Source code in fastvideo/tests/ssim/test_matrixgame3_similarity.py
@pytest.mark.parametrize("prompt", TEST_PROMPTS)
@pytest.mark.parametrize("ATTENTION_BACKEND", ["FLASH_ATTN"])
@pytest.mark.parametrize("model_id", list(MODEL_TO_PARAMS.keys()))
def test_matrixgame3_similarity(prompt, ATTENTION_BACKEND, model_id):
    """
    Test that runs MG3 inference (action conditions auto-generated from seed)
    and compares the output to reference videos using SSIM.
    """
    os.environ["FASTVIDEO_ATTENTION_BACKEND"] = ATTENTION_BACKEND

    script_dir = os.path.dirname(os.path.abspath(__file__))

    output_dir = build_generated_output_dir(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )

    os.makedirs(output_dir, exist_ok=True)

    params_map = select_ssim_params(
        MODEL_TO_PARAMS,
        FULL_QUALITY_MODEL_TO_PARAMS,
    )
    BASE_PARAMS = params_map[model_id]
    num_inference_steps = BASE_PARAMS["num_inference_steps"]

    init_kwargs = {
        "num_gpus": BASE_PARAMS["num_gpus"],
        "use_fsdp_inference": True,
        "dit_layerwise_offload": False,
        "dit_cpu_offload": False,
        "vae_cpu_offload": False,
        "text_encoder_cpu_offload": True,
        "pin_cpu_memory": True,
    }

    generation_kwargs = {
        "num_inference_steps": num_inference_steps,
        "output_path": output_dir,
        "image_path": TEST_IMAGE_PATHS[0],
        "height": BASE_PARAMS["height"],
        "width": BASE_PARAMS["width"],
        "num_frames": BASE_PARAMS["num_frames"],
        "guidance_scale": BASE_PARAMS["guidance_scale"],
        "seed": BASE_PARAMS["seed"],
        "save_video": True,
    }

    generator = VideoGenerator.from_pretrained(model_path=BASE_PARAMS["model_path"], **init_kwargs)
    generator.generate_video(prompt, **generation_kwargs)

    if isinstance(generator.executor, MultiprocExecutor):
        generator.executor.shutdown()

    assert os.path.exists(output_dir), f"Output video was not generated at {output_dir}"

    reference_folder = build_reference_folder_path(
        script_dir,
        device_reference_folder,
        model_id,
        ATTENTION_BACKEND,
    )

    if not os.path.exists(reference_folder):
        logger.error("Reference folder missing")
        raise FileNotFoundError(f"Reference video folder does not exist: {reference_folder}")

    prompt_prefix = prompt[:100].strip().rstrip(".")

    def _find_mp4(folder: str) -> str | None:
        for filename in sorted(os.listdir(folder)):
            if filename.endswith(".mp4") and prompt_prefix in filename:
                return filename
        return None

    reference_video_name = _find_mp4(reference_folder)
    if not reference_video_name:
        logger.error(f"Reference video not found for model: {model_id} with backend: {ATTENTION_BACKEND}")
        raise FileNotFoundError("Reference video missing")

    generated_video_name = _find_mp4(output_dir)
    if not generated_video_name:
        logger.error(f"Generated video not found for model: {model_id} with backend: {ATTENTION_BACKEND}")
        raise FileNotFoundError(f"Generated video missing in {output_dir}")

    reference_video_path = os.path.join(reference_folder, reference_video_name)
    generated_video_path = os.path.join(output_dir, generated_video_name)

    logger.info(f"Computing SSIM between {reference_video_path} and {generated_video_path}")
    ssim_values = compute_video_ssim_torchvision(reference_video_path, generated_video_path, use_ms_ssim=True)

    mean_ssim = ssim_values[0]
    logger.info(f"SSIM mean value: {mean_ssim}")
    logger.info(f"Writing SSIM results to directory: {output_dir}")

    success = write_ssim_results(
        output_dir,
        ssim_values,
        reference_video_path,
        generated_video_path,
        num_inference_steps,
        prompt,
    )

    if not success:
        logger.error("Failed to write SSIM results to file")

    min_acceptable_ssim = 0.98
    assert mean_ssim >= min_acceptable_ssim, (
        f"SSIM value {mean_ssim} is below threshold {min_acceptable_ssim} for {model_id} with backend {ATTENTION_BACKEND}"
    )

fastvideo.tests.ssim.test_stable_audio_similarity

Latent-slice regression test for Stable Audio Open 1.0 text-to-audio.

Companion to test_ltx2_similarity.py — applies the same latent cosine-distance philosophy to 3-D audio latents [B, 64, T_latent].

Why latent-space and not waveform-space SSIM: - dpmpp-3m-sde (k-diffusion) accumulates per-step bf16 noise; the Oobleck VAE then magnifies any residual drift into the time-domain waveform. A few mis-rounded accumulators drive sample-wise diff well past audible thresholds without indicating a real regression. - Diffusers' own tests/pipelines/stable_audio/test_stable_audio.py compares decoded audio samples via np.abs(expected - actual).max() < 1.5e-3; that bound holds for CPU dummy components but does not survive cross-architecture bf16 on our CI pool (L40S/A40/H100/B200). - Comparing the pre-VAE latent moves the assertion upstream of the dominant noise source.

Slice spec: audio_first_8_timesteps returns latent[0, :, :8] (= 64 channels × 8 latent timesteps = 512 elements). The full latent [1, 64, 1024] for SA-1.0 is also compared via cosine distance.

Classes

Functions: