test_ltx2_similarity
¶
Latent-slice regression test for LTX-2 distilled text-to-video.
Pixel-space SSIM is not a useful signal for this model: 4 distilled steps + bf16 attention + tiled VAE decode produce outputs that pass visual QA but occupy a very wide region in pixel space.
Inspired by diffusers' slice-vs-full regression philosophy — see
diffusers/tests/pipelines/ltx2/test_ltx2.py (compares pixel slices
via torch.allclose(..., atol=1e-4)) and
diffusers/tests/pipelines/cogvideo/test_cogvideox.py (full pixel
tensors via numpy_cosine_similarity_distance(...) < 1e-3).
Diffusers itself does NOT compare latents; we apply the same "small
signature slice + bounded full-tensor distance" idea to the pre-VAE
latent because distilled few-step pipelines amplify per-step bf16
noise enough that VAE-decoded comparisons are unreliable.
Parameters are kept identical to the original SSIM run so that reference artefacts generated on Modal L40S remain bit-compatible with production inference.