test_stable_audio_similarity
¶
Latent-slice regression test for Stable Audio Open 1.0 text-to-audio.
Companion to test_ltx2_similarity.py — applies the same latent
cosine-distance philosophy to 3-D audio latents [B, 64, T_latent].
Why latent-space and not waveform-space SSIM:
- dpmpp-3m-sde (k-diffusion) accumulates per-step bf16 noise; the
Oobleck VAE then magnifies any residual drift into the time-domain
waveform. A few mis-rounded accumulators drive sample-wise diff well
past audible thresholds without indicating a real regression.
- Diffusers' own
tests/pipelines/stable_audio/test_stable_audio.py compares
decoded audio samples via
np.abs(expected - actual).max() < 1.5e-3; that bound holds for
CPU dummy components but does not survive cross-architecture bf16
on our CI pool (L40S/A40/H100/B200).
- Comparing the pre-VAE latent moves the assertion upstream of
the dominant noise source.
Slice spec: audio_first_8_timesteps returns
latent[0, :, :8] (= 64 channels × 8 latent timesteps = 512
elements). The full latent [1, 64, 1024] for SA-1.0 is also
compared via cosine distance.