Skip to content

stable_audio

Modules

fastvideo.pipelines.basic.stable_audio.presets

Stable Audio presets.

Sampling defaults track the published HF model card (https://huggingface.co/stabilityai/stable-audio-open-1.0): 100 steps, CFG=7, dpmpp-3m-sde, sigma_min=0.3, sigma_max=500, rho=1.0.

Classes

fastvideo.pipelines.basic.stable_audio.stable_audio_pipeline

Stable Audio Open 1.0 pipeline (T2A + A2A + RePaint inpainting).

Components are loaded via the standard ComposedPipelineBase.load_modules against the FastVideo-curated Diffusers-format repo FastVideo/stable-audio-open-1.0-Diffusers (produced by scripts/checkpoint_conversion/stable_audio_to_diffusers.py). The DiT is a BaseDiT subclass loaded by TransformerLoader; the VAE is loaded by VAELoader; the multi-conditioner (T5 + NumberConditioners) is loaded by ConditionerLoader (a Stable Audio-specific addition).

Stages:

InputValidationStage
  → StableAudioConditioningStage      (T5 + NumberConditioner -> cross-attn + global cond, with CFG)
  → StableAudioLatentPreparationStage (initial Gaussian noise; encodes A2A / inpaint refs)
  → StableAudioDenoisingStage         (k-diffusion `dpmpp-3m-sde` over the DiT)
  → StableAudioDecodingStage          (OobleckVAE -> waveform)

Classes

fastvideo.pipelines.basic.stable_audio.stable_audio_pipeline.StableAudioPipeline
StableAudioPipeline(model_path: str, fastvideo_args: FastVideoArgs | TrainingArgs, required_config_modules: list[str] | None = None, loaded_modules: dict[str, Module] | None = None)

Bases: ComposedPipelineBase

Stable Audio Open 1.0 pipeline.

Mode is kwargs-driven on generate_video():

* Text-to-audio (default) -- prompt=..., audio_end_in_s=... * Audio-to-audio variation -- add init_audio=ref (and optionally init_noise_level, lower = closer to reference) * RePaint inpainting / outpainting -- add inpaint_audio=ref and inpaint_mask (1-D, 1 = keep / 0 = regenerate)

See examples/inference/basic/basic_stable_audio*.py for runnable examples of each mode.

Source code in fastvideo/pipelines/composed_pipeline_base.py
def __init__(self,
             model_path: str,
             fastvideo_args: FastVideoArgs | TrainingArgs,
             required_config_modules: list[str] | None = None,
             loaded_modules: dict[str, torch.nn.Module] | None = None):
    """
    Initialize the pipeline. After __init__, the pipeline should be ready to
    use. The pipeline should be stateless and not hold any batch state.
    """
    self.fastvideo_args = fastvideo_args

    self.model_path: str = model_path
    self._stages: list[PipelineStage] = []
    self._stage_name_mapping: dict[str, PipelineStage] = {}
    self._trace_mgr = None

    if required_config_modules is not None:
        self._required_config_modules = required_config_modules

    if self._required_config_modules is None:
        raise NotImplementedError("Subclass must set _required_config_modules")

    maybe_init_distributed_environment_and_model_parallel(fastvideo_args.tp_size, fastvideo_args.sp_size)

    # Torch profiler. Enabled and configured through env vars:
    # FASTVIDEO_TORCH_PROFILER_DIR=/path/to/save/trace
    trace_dir = envs.FASTVIDEO_TORCH_PROFILER_DIR
    self.profiler_controller = get_or_create_profiler(trace_dir)
    self.profiler = self.profiler_controller.profiler

    self.local_rank = get_world_group().local_rank

    # Load modules directly in initialization
    logger.info("Loading pipeline modules...")
    with self.profiler_controller.region("profiler_region_model_loading"):
        self.modules = self.load_modules(fastvideo_args, loaded_modules)
Functions
fastvideo.pipelines.basic.stable_audio.stable_audio_pipeline.StableAudioPipeline.initialize_pipeline
initialize_pipeline(fastvideo_args: FastVideoArgs) -> None

Apply Stable Audio's process-global numerics overrides BEFORE the standard component loaders run (TF32 off for A2A renoise determinism).

Source code in fastvideo/pipelines/basic/stable_audio/stable_audio_pipeline.py
def initialize_pipeline(self, fastvideo_args: FastVideoArgs) -> None:
    """Apply Stable Audio's process-global numerics overrides BEFORE
    the standard component loaders run (TF32 off for A2A renoise
    determinism)."""
    _disable_tf32_for_stable_audio()

Functions

fastvideo.pipelines.basic.stable_audio.stages

Classes

fastvideo.pipelines.basic.stable_audio.stages.StableAudioConditioningStage
StableAudioConditioningStage(conditioner)

Bases: PipelineStage

Run the conditioner over the prompt + duration and stash the DiT-ready (cross_attn_cond, cross_attn_mask, global_embed) triple on batch.extra (plus the negative-prompt triple when CFG is on).

Source code in fastvideo/pipelines/basic/stable_audio/stages/conditioning.py
def __init__(self, conditioner) -> None:
    super().__init__()
    self.conditioner = conditioner
fastvideo.pipelines.basic.stable_audio.stages.StableAudioDecodingStage
StableAudioDecodingStage(vae)

Bases: PipelineStage

Decode latent → audio waveform + slice to [start, end].

Source code in fastvideo/pipelines/basic/stable_audio/stages/decoding.py
def __init__(self, vae) -> None:
    super().__init__()
    self.vae = vae
fastvideo.pipelines.basic.stable_audio.stages.StableAudioDenoisingStage
StableAudioDenoisingStage(transformer)

Bases: PipelineStage

k-diffusion dpmpp-3m-sde sampling loop.

Source code in fastvideo/pipelines/basic/stable_audio/stages/denoising.py
def __init__(self, transformer) -> None:
    super().__init__()
    self.transformer = transformer
fastvideo.pipelines.basic.stable_audio.stages.StableAudioLatentPreparationStage
StableAudioLatentPreparationStage(io_channels: int = 64, sample_size: int = 2097152, vae=None, sample_rate: int = 44100, audio_channels: int = 2)

Bases: PipelineStage

Source code in fastvideo/pipelines/basic/stable_audio/stages/latent_preparation.py
def __init__(self,
             io_channels: int = 64,
             sample_size: int = 2097152,
             vae=None,
             sample_rate: int = 44100,
             audio_channels: int = 2) -> None:
    super().__init__()
    self.io_channels = io_channels
    # Audio-domain length the model was trained for; latent length
    # = sample_size // vae.hop_length (= 2097152 / 2048 = 1024).
    self.sample_size = sample_size
    self.vae = vae  # used to encode init_audio / inpaint_audio
    self.sample_rate = sample_rate
    self.audio_channels = audio_channels

Modules

fastvideo.pipelines.basic.stable_audio.stages.conditioning

Stable Audio conditioning stage.

Classes
fastvideo.pipelines.basic.stable_audio.stages.conditioning.StableAudioConditioningStage
StableAudioConditioningStage(conditioner)

Bases: PipelineStage

Run the conditioner over the prompt + duration and stash the DiT-ready (cross_attn_cond, cross_attn_mask, global_embed) triple on batch.extra (plus the negative-prompt triple when CFG is on).

Source code in fastvideo/pipelines/basic/stable_audio/stages/conditioning.py
def __init__(self, conditioner) -> None:
    super().__init__()
    self.conditioner = conditioner
fastvideo.pipelines.basic.stable_audio.stages.decoding

Stable Audio decoding: latent -> waveform via OobleckVAE.

Slices the output to [audio_start_in_s, audio_end_in_s] and stashes the result on batch.extra["audio"] + ["audio_sample_rate"] for VideoGenerator._mux_audio to pick up.

Classes
fastvideo.pipelines.basic.stable_audio.stages.decoding.StableAudioDecodingStage
StableAudioDecodingStage(vae)

Bases: PipelineStage

Decode latent → audio waveform + slice to [start, end].

Source code in fastvideo/pipelines/basic/stable_audio/stages/decoding.py
def __init__(self, vae) -> None:
    super().__init__()
    self.vae = vae
fastvideo.pipelines.basic.stable_audio.stages.denoising

Stable Audio denoising — k-diffusion dpmpp-3m-sde over the DiT.

CFG-batched conditioning is built once outside the sampler loop so the adapter only does cat([x, x]) + DiT call per step.

Classes
fastvideo.pipelines.basic.stable_audio.stages.denoising.StableAudioDenoisingStage
StableAudioDenoisingStage(transformer)

Bases: PipelineStage

k-diffusion dpmpp-3m-sde sampling loop.

Source code in fastvideo/pipelines/basic/stable_audio/stages/denoising.py
def __init__(self, transformer) -> None:
    super().__init__()
    self.transformer = transformer
Functions
fastvideo.pipelines.basic.stable_audio.stages.latent_preparation

Stable Audio latent preparation.

Seeds + samples the initial Gaussian noise; encodes init_audio (A2A variation) or inpaint_audio + inpaint_mask (RePaint inpainting) into latent-space tensors on batch.extra for the denoising stage.

Classes
fastvideo.pipelines.basic.stable_audio.stages.latent_preparation.StableAudioLatentPreparationStage
StableAudioLatentPreparationStage(io_channels: int = 64, sample_size: int = 2097152, vae=None, sample_rate: int = 44100, audio_channels: int = 2)

Bases: PipelineStage

Source code in fastvideo/pipelines/basic/stable_audio/stages/latent_preparation.py
def __init__(self,
             io_channels: int = 64,
             sample_size: int = 2097152,
             vae=None,
             sample_rate: int = 44100,
             audio_channels: int = 2) -> None:
    super().__init__()
    self.io_channels = io_channels
    # Audio-domain length the model was trained for; latent length
    # = sample_size // vae.hop_length (= 2097152 / 2048 = 1024).
    self.sample_size = sample_size
    self.vae = vae  # used to encode init_audio / inpaint_audio
    self.sample_rate = sample_rate
    self.audio_channels = audio_channels