stable_audio
¶
Modules¶
fastvideo.pipelines.basic.stable_audio.presets
¶
Stable Audio presets.
Sampling defaults track the published HF model card (https://huggingface.co/stabilityai/stable-audio-open-1.0): 100 steps, CFG=7, dpmpp-3m-sde, sigma_min=0.3, sigma_max=500, rho=1.0.
Classes¶
fastvideo.pipelines.basic.stable_audio.stable_audio_pipeline
¶
Stable Audio Open 1.0 pipeline (T2A + A2A + RePaint inpainting).
Components are loaded via the standard
ComposedPipelineBase.load_modules against the FastVideo-curated
Diffusers-format repo FastVideo/stable-audio-open-1.0-Diffusers
(produced by
scripts/checkpoint_conversion/stable_audio_to_diffusers.py). The DiT
is a BaseDiT subclass loaded by TransformerLoader; the VAE is
loaded by VAELoader; the multi-conditioner (T5 + NumberConditioners)
is loaded by ConditionerLoader (a Stable Audio-specific addition).
Stages:
InputValidationStage
→ StableAudioConditioningStage (T5 + NumberConditioner -> cross-attn + global cond, with CFG)
→ StableAudioLatentPreparationStage (initial Gaussian noise; encodes A2A / inpaint refs)
→ StableAudioDenoisingStage (k-diffusion `dpmpp-3m-sde` over the DiT)
→ StableAudioDecodingStage (OobleckVAE -> waveform)
Classes¶
fastvideo.pipelines.basic.stable_audio.stable_audio_pipeline.StableAudioPipeline
¶
StableAudioPipeline(model_path: str, fastvideo_args: FastVideoArgs | TrainingArgs, required_config_modules: list[str] | None = None, loaded_modules: dict[str, Module] | None = None)
Bases: ComposedPipelineBase
Stable Audio Open 1.0 pipeline.
Mode is kwargs-driven on generate_video():
* Text-to-audio (default) -- prompt=..., audio_end_in_s=...
* Audio-to-audio variation -- add init_audio=ref (and optionally
init_noise_level, lower = closer to reference)
* RePaint inpainting / outpainting -- add inpaint_audio=ref and
inpaint_mask (1-D, 1 = keep / 0 = regenerate)
See examples/inference/basic/basic_stable_audio*.py for runnable
examples of each mode.
Source code in fastvideo/pipelines/composed_pipeline_base.py
Functions¶
fastvideo.pipelines.basic.stable_audio.stable_audio_pipeline.StableAudioPipeline.initialize_pipeline
¶initialize_pipeline(fastvideo_args: FastVideoArgs) -> None
Apply Stable Audio's process-global numerics overrides BEFORE the standard component loaders run (TF32 off for A2A renoise determinism).
Source code in fastvideo/pipelines/basic/stable_audio/stable_audio_pipeline.py
Functions¶
fastvideo.pipelines.basic.stable_audio.stages
¶
Classes¶
fastvideo.pipelines.basic.stable_audio.stages.StableAudioConditioningStage
¶
Bases: PipelineStage
Run the conditioner over the prompt + duration and stash the
DiT-ready (cross_attn_cond, cross_attn_mask, global_embed) triple
on batch.extra (plus the negative-prompt triple when CFG is on).
Source code in fastvideo/pipelines/basic/stable_audio/stages/conditioning.py
fastvideo.pipelines.basic.stable_audio.stages.StableAudioDecodingStage
¶
Bases: PipelineStage
Decode latent → audio waveform + slice to [start, end].
Source code in fastvideo/pipelines/basic/stable_audio/stages/decoding.py
fastvideo.pipelines.basic.stable_audio.stages.StableAudioDenoisingStage
¶
Bases: PipelineStage
k-diffusion dpmpp-3m-sde sampling loop.
Source code in fastvideo/pipelines/basic/stable_audio/stages/denoising.py
fastvideo.pipelines.basic.stable_audio.stages.StableAudioLatentPreparationStage
¶
StableAudioLatentPreparationStage(io_channels: int = 64, sample_size: int = 2097152, vae=None, sample_rate: int = 44100, audio_channels: int = 2)
Bases: PipelineStage
Source code in fastvideo/pipelines/basic/stable_audio/stages/latent_preparation.py
Modules¶
fastvideo.pipelines.basic.stable_audio.stages.conditioning
¶
Stable Audio conditioning stage.
Classes¶
fastvideo.pipelines.basic.stable_audio.stages.conditioning.StableAudioConditioningStage
¶
Bases: PipelineStage
Run the conditioner over the prompt + duration and stash the
DiT-ready (cross_attn_cond, cross_attn_mask, global_embed) triple
on batch.extra (plus the negative-prompt triple when CFG is on).
Source code in fastvideo/pipelines/basic/stable_audio/stages/conditioning.py
fastvideo.pipelines.basic.stable_audio.stages.decoding
¶
Stable Audio decoding: latent -> waveform via OobleckVAE.
Slices the output to [audio_start_in_s, audio_end_in_s] and stashes
the result on batch.extra["audio"] + ["audio_sample_rate"] for
VideoGenerator._mux_audio to pick up.
Classes¶
fastvideo.pipelines.basic.stable_audio.stages.decoding.StableAudioDecodingStage
¶
Bases: PipelineStage
Decode latent → audio waveform + slice to [start, end].
Source code in fastvideo/pipelines/basic/stable_audio/stages/decoding.py
fastvideo.pipelines.basic.stable_audio.stages.denoising
¶
Stable Audio denoising — k-diffusion dpmpp-3m-sde over the DiT.
CFG-batched conditioning is built once outside the sampler loop so the
adapter only does cat([x, x]) + DiT call per step.
Classes¶
fastvideo.pipelines.basic.stable_audio.stages.denoising.StableAudioDenoisingStage
¶
Bases: PipelineStage
k-diffusion dpmpp-3m-sde sampling loop.
Source code in fastvideo/pipelines/basic/stable_audio/stages/denoising.py
Functions¶
fastvideo.pipelines.basic.stable_audio.stages.latent_preparation
¶
Stable Audio latent preparation.
Seeds + samples the initial Gaussian noise; encodes init_audio (A2A
variation) or inpaint_audio + inpaint_mask (RePaint inpainting) into
latent-space tensors on batch.extra for the denoising stage.
Classes¶
fastvideo.pipelines.basic.stable_audio.stages.latent_preparation.StableAudioLatentPreparationStage
¶StableAudioLatentPreparationStage(io_channels: int = 64, sample_size: int = 2097152, vae=None, sample_rate: int = 44100, audio_channels: int = 2)
Bases: PipelineStage