latent_similarity_utils
¶
Latent-space regression helpers for numerically fragile SSIM tests.
Motivation¶
Pixel-space SSIM is a poor regression signal for distilled / few-step models (e.g. LTX-2 distilled): a single mis-rounded bf16 accumulator in the VAE decoder can drive mean SSIM from ~0.95 to ~0.50 without any real quality regression.
Inspired by diffusers' "small signature slice + bounded full-tensor distance" testing philosophy, applied here to the pre-VAE latent rather than the decoded pixel/audio output:
tests/pipelines/ltx2/test_ltx2.py(diffusers) comparesoutput_type='pt'(pixel) slices viatorch.allclose(generated_slice, expected_slice, atol=1e-4);tests/pipelines/stable_audio/test_stable_audio.py(diffusers) compares decoded audio samples vianp.abs(expected - actual).max() < 1.5e-3;tests/pipelines/cogvideo/test_cogvideox.py(diffusers) compares full pixel video tensors vianumpy_cosine_similarity_distance(...) < 1e-3.
Diffusers does not assert on latents directly — that is a FastVideo adaptation. Distilled few-step pipelines amplify per-step bf16 noise enough that VAE-decoded comparisons are unreliable across our heterogeneous CI pool, so we move the assertion upstream of the VAE.
Design¶
- Inference is run with
output_type='latent'soDecodingStagehands back the un-decoded latent onresult["samples"]. - The reference artefact is a
.ptbundle (tensor + metadata) hosted on the same HF dataset as the mp4 references, selected by<GPU>_reference_videos/<model_id>/<backend>/<prompt>.pt. - Two assertions are performed:
1. A small signature slice (default video:
latent[0, :, 0, :3, :3]; audio:latent[0, :, :8]) is compared via cosine distance with a loose tolerance. Primary pass/fail gate. 2. The full latent is compared via cosine distance with a slightly tighter tolerance, guarding against shape-correct but globally drifted outputs. - Tolerances default to 5e-3 (slice) and 1e-2 (full). diffusers uses
1e-3against deterministic CPU dummy components; we relax for cross-GPU-arch bf16 differences on the rented CI pool (A40/L40S/H100/B200).
The helper intentionally reuses build_init_kwargs /
build_generation_kwargs from :mod:inference_similarity_utils so
model params (vae tiling, sp_size, flow shift, …) flow through a single
source of truth.
Classes¶
Functions¶
fastvideo.tests.ssim.latent_similarity_utils.load_latent_reference
¶
Inverse of :func:save_latent_reference — always loads to cpu.
Enforces format_version == LATENT_REFERENCE_FORMAT_VERSION so a
schema change forces a deliberate reseed instead of silently
misinterpreting old artefacts.
Source code in fastvideo/tests/ssim/latent_similarity_utils.py
fastvideo.tests.ssim.latent_similarity_utils.run_text_to_latent_similarity_test
¶
run_text_to_latent_similarity_test(*, logger: Logger, script_dir: str, device_reference_folder: str, prompt: str, attention_backend_name: str, model_id: str, default_params_map: dict[str, dict[str, object]], full_quality_params_map: dict[str, dict[str, object]], slice_cosine_threshold: float = 0.005, full_cosine_threshold: float = 0.01, init_kwargs_override: dict[str, object] | None = None, generation_kwargs_override: dict[str, object] | None = None, slice_spec: dict[str, Any] | None = None) -> dict[str, float]
Run T2V (or T2A) inference with output_type='latent' and
compare to a reference latent.
Returns the computed metrics dict on success. Raises
AssertionError if any cosine tolerance is exceeded and
FileNotFoundError if the reference artefact is missing.
Source code in fastvideo/tests/ssim/latent_similarity_utils.py
375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 | |
fastvideo.tests.ssim.latent_similarity_utils.save_latent_reference
¶
save_latent_reference(path: str, latent: Tensor, *, metadata: dict[str, Any], slice_spec: dict[str, Any] | None = None) -> None
Persist a latent bundle to path.
Storage format (dict pickled via torch.save):
latent: full latent as fp16 on cpushape: original shape (list)dtype_original: strexpected_slice: fp32 1-D signature sliceslice_spec: dict describing how the slice was builtmetadata: caller-provided context (prompt, backend, steps, …)format_version: int
fp16 is lossy but bounded; it keeps ref artefacts small (~a few MB per prompt) while preserving enough dynamic range for cosine-based regression. Slice values stay fp32 because the primary assertion is computed against them.
Source code in fastvideo/tests/ssim/latent_similarity_utils.py
fastvideo.tests.ssim.latent_similarity_utils.write_latent_similarity_results
¶
write_latent_similarity_results(output_dir: str, metrics: dict[str, float], *, reference_path: str, generated_path: str, num_inference_steps: int, prompt: str, model_id: str, attention_backend_name: str, slice_spec: dict[str, Any], slice_cosine_threshold: float, full_cosine_threshold: float, passed: bool) -> bool
Persist latent regression metrics next to the generated artefact.
Mirrors :func:fastvideo.tests.utils.write_ssim_results so downstream
CI tooling can scrape one schema for both pixel and latent runs.
The filename is steps{N}_{prompt[:100]}_latent.json.