Skip to content

test_stable_audio_similarity

Latent-slice regression test for Stable Audio Open 1.0 text-to-audio.

Companion to test_ltx2_similarity.py — applies the same latent cosine-distance philosophy to 3-D audio latents [B, 64, T_latent].

Why latent-space and not waveform-space SSIM: - dpmpp-3m-sde (k-diffusion) accumulates per-step bf16 noise; the Oobleck VAE then magnifies any residual drift into the time-domain waveform. A few mis-rounded accumulators drive sample-wise diff well past audible thresholds without indicating a real regression. - Diffusers' own tests/pipelines/stable_audio/test_stable_audio.py compares decoded audio samples via np.abs(expected - actual).max() < 1.5e-3; that bound holds for CPU dummy components but does not survive cross-architecture bf16 on our CI pool (L40S/A40/H100/B200). - Comparing the pre-VAE latent moves the assertion upstream of the dominant noise source.

Slice spec: audio_first_8_timesteps returns latent[0, :, :8] (= 64 channels × 8 latent timesteps = 512 elements). The full latent [1, 64, 1024] for SA-1.0 is also compared via cosine distance.

Classes

Functions