Skip to content

stages

Classes

fastvideo.pipelines.basic.stable_audio.stages.StableAudioConditioningStage

StableAudioConditioningStage(conditioner)

Bases: PipelineStage

Run the conditioner over the prompt + duration and stash the DiT-ready (cross_attn_cond, cross_attn_mask, global_embed) triple on batch.extra (plus the negative-prompt triple when CFG is on).

Source code in fastvideo/pipelines/basic/stable_audio/stages/conditioning.py
def __init__(self, conditioner) -> None:
    super().__init__()
    self.conditioner = conditioner

fastvideo.pipelines.basic.stable_audio.stages.StableAudioDecodingStage

StableAudioDecodingStage(vae)

Bases: PipelineStage

Decode latent → audio waveform + slice to [start, end].

Source code in fastvideo/pipelines/basic/stable_audio/stages/decoding.py
def __init__(self, vae) -> None:
    super().__init__()
    self.vae = vae

fastvideo.pipelines.basic.stable_audio.stages.StableAudioDenoisingStage

StableAudioDenoisingStage(transformer)

Bases: PipelineStage

k-diffusion dpmpp-3m-sde sampling loop.

Source code in fastvideo/pipelines/basic/stable_audio/stages/denoising.py
def __init__(self, transformer) -> None:
    super().__init__()
    self.transformer = transformer

fastvideo.pipelines.basic.stable_audio.stages.StableAudioLatentPreparationStage

StableAudioLatentPreparationStage(io_channels: int = 64, sample_size: int = 2097152, vae=None, sample_rate: int = 44100, audio_channels: int = 2)

Bases: PipelineStage

Source code in fastvideo/pipelines/basic/stable_audio/stages/latent_preparation.py
def __init__(self,
             io_channels: int = 64,
             sample_size: int = 2097152,
             vae=None,
             sample_rate: int = 44100,
             audio_channels: int = 2) -> None:
    super().__init__()
    self.io_channels = io_channels
    # Audio-domain length the model was trained for; latent length
    # = sample_size // vae.hop_length (= 2097152 / 2048 = 1024).
    self.sample_size = sample_size
    self.vae = vae  # used to encode init_audio / inpaint_audio
    self.sample_rate = sample_rate
    self.audio_channels = audio_channels

Modules

fastvideo.pipelines.basic.stable_audio.stages.conditioning

Stable Audio conditioning stage.

Classes

fastvideo.pipelines.basic.stable_audio.stages.conditioning.StableAudioConditioningStage
StableAudioConditioningStage(conditioner)

Bases: PipelineStage

Run the conditioner over the prompt + duration and stash the DiT-ready (cross_attn_cond, cross_attn_mask, global_embed) triple on batch.extra (plus the negative-prompt triple when CFG is on).

Source code in fastvideo/pipelines/basic/stable_audio/stages/conditioning.py
def __init__(self, conditioner) -> None:
    super().__init__()
    self.conditioner = conditioner

fastvideo.pipelines.basic.stable_audio.stages.decoding

Stable Audio decoding: latent -> waveform via OobleckVAE.

Slices the output to [audio_start_in_s, audio_end_in_s] and stashes the result on batch.extra["audio"] + ["audio_sample_rate"] for VideoGenerator._mux_audio to pick up.

Classes

fastvideo.pipelines.basic.stable_audio.stages.decoding.StableAudioDecodingStage
StableAudioDecodingStage(vae)

Bases: PipelineStage

Decode latent → audio waveform + slice to [start, end].

Source code in fastvideo/pipelines/basic/stable_audio/stages/decoding.py
def __init__(self, vae) -> None:
    super().__init__()
    self.vae = vae

fastvideo.pipelines.basic.stable_audio.stages.denoising

Stable Audio denoising — k-diffusion dpmpp-3m-sde over the DiT.

CFG-batched conditioning is built once outside the sampler loop so the adapter only does cat([x, x]) + DiT call per step.

Classes

fastvideo.pipelines.basic.stable_audio.stages.denoising.StableAudioDenoisingStage
StableAudioDenoisingStage(transformer)

Bases: PipelineStage

k-diffusion dpmpp-3m-sde sampling loop.

Source code in fastvideo/pipelines/basic/stable_audio/stages/denoising.py
def __init__(self, transformer) -> None:
    super().__init__()
    self.transformer = transformer

Functions

fastvideo.pipelines.basic.stable_audio.stages.latent_preparation

Stable Audio latent preparation.

Seeds + samples the initial Gaussian noise; encodes init_audio (A2A variation) or inpaint_audio + inpaint_mask (RePaint inpainting) into latent-space tensors on batch.extra for the denoising stage.

Classes

fastvideo.pipelines.basic.stable_audio.stages.latent_preparation.StableAudioLatentPreparationStage
StableAudioLatentPreparationStage(io_channels: int = 64, sample_size: int = 2097152, vae=None, sample_rate: int = 44100, audio_channels: int = 2)

Bases: PipelineStage

Source code in fastvideo/pipelines/basic/stable_audio/stages/latent_preparation.py
def __init__(self,
             io_channels: int = 64,
             sample_size: int = 2097152,
             vae=None,
             sample_rate: int = 44100,
             audio_channels: int = 2) -> None:
    super().__init__()
    self.io_channels = io_channels
    # Audio-domain length the model was trained for; latent length
    # = sample_size // vae.hop_length (= 2097152 / 2048 = 1024).
    self.sample_size = sample_size
    self.vae = vae  # used to encode init_audio / inpaint_audio
    self.sample_rate = sample_rate
    self.audio_channels = audio_channels