stages
¶
Classes¶
fastvideo.pipelines.basic.magi_human.stages.MagiHumanAudioDecodingStage
¶
MagiHumanAudioDecodingStage(audio_vae, time_stretching: float = _UPSTREAM_AUDIO_TIME_STRETCH)
Bases: PipelineStage
Decode batch.audio_latents to a waveform using Stable Audio's VAE.
The VAE is loaded lazily by SAAudioVAEModel.sa_audio_vae_model — the
first call triggers a snapshot_download (requires HF token + accepted
terms on stabilityai/stable-audio-open-1.0).
Source code in fastvideo/pipelines/basic/magi_human/stages/audio_decoding.py
fastvideo.pipelines.basic.magi_human.stages.MagiHumanDenoisingStage
¶
MagiHumanDenoisingStage(transformer, scheduler, patch_size: tuple[int, int, int] = (1, 2, 2), video_in_channels: int = 192, audio_in_channels: int = 64, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, coords_style: str = 'v2', video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0)
Bases: PipelineStage
UniPC-flow joint denoising with CFG=2 over (video, audio) latents.
Source code in fastvideo/pipelines/basic/magi_human/stages/denoising.py
fastvideo.pipelines.basic.magi_human.stages.MagiHumanLatentPreparationStage
¶
MagiHumanLatentPreparationStage(vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, patch_size: tuple[int, int, int] = (1, 2, 2), fps: int = 25, t5_gemma_target_length: int = 640, coords_style: Literal['v1', 'v2'] = 'v2', text_offset: int = 0, audio_in_channels: int = 64)
Bases: PipelineStage
Prepare latents, coords, modality maps, and padded text embed.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.MagiHumanReferenceImageStage
¶
Bases: PipelineStage
Encode a TI2V reference image into the first-frame video latent.
Source code in fastvideo/pipelines/basic/magi_human/stages/reference_image.py
fastvideo.pipelines.basic.magi_human.stages.MagiHumanSRDenoisingStage
¶
MagiHumanSRDenoisingStage(transformer, scheduler, patch_size: tuple[int, int, int] = (1, 2, 2), video_in_channels: int = 192, audio_in_channels: int = 64, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, cfg_number: int = 2, coords_style: str = 'v1')
Bases: PipelineStage
Denoise only the SR video latent; audio passes through unchanged.
Source code in fastvideo/pipelines/basic/magi_human/stages/sr_denoising.py
fastvideo.pipelines.basic.magi_human.stages.MagiHumanSRLatentPreparationStage
¶
MagiHumanSRLatentPreparationStage(vae: Any, vae_stride: tuple[int, int, int] = (4, 16, 16), patch_size: tuple[int, int, int] = (1, 2, 2), noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_height: int = 512, sr_width: int = 896, vae_scale_factor: int = 16)
Bases: PipelineStage
Upsample base latents, add SR noise, and refresh SR conditioning.
Source code in fastvideo/pipelines/basic/magi_human/stages/sr_latent_preparation.py
Modules¶
fastvideo.pipelines.basic.magi_human.stages.audio_decoding
¶
Audio decoding stage for daVinci-MagiHuman.
Takes the denoised audio latent that MagiHumanDenoisingStage leaves
on batch.audio_latents and decodes it to a waveform using the
Stable Audio Open 1.0 VAE. Mirrors the upstream post-process path
(see MagiEvaluator.post_process in
daVinci-MagiHuman/inference/pipeline/video_generate.py:503):
latent_audio.squeeze(0) # (L, C_latent)
audio = self.audio_vae.decode(latent_audio.T) # (1, audio_ch, samples)
audio = audio.squeeze(0).T.cpu().numpy() # (samples, audio_ch)
audio = resample_audio_sinc(audio, _UPSTREAM_AUDIO_TIME_STRETCH)
The stage stores the resampled waveform on batch.extra["audio"]
(shape [samples, audio_channels]) and the sample rate on
batch.extra["audio_sample_rate"]. FastVideo's VideoGenerator._mux_audio
then reads those, writes a temp wav, and muxes it into the output mp4
via PyAV — same plumbing LTX-2 and Stable Audio use.
Classes¶
fastvideo.pipelines.basic.magi_human.stages.audio_decoding.MagiHumanAudioDecodingStage
¶
MagiHumanAudioDecodingStage(audio_vae, time_stretching: float = _UPSTREAM_AUDIO_TIME_STRETCH)
Bases: PipelineStage
Decode batch.audio_latents to a waveform using Stable Audio's VAE.
The VAE is loaded lazily by SAAudioVAEModel.sa_audio_vae_model — the
first call triggers a snapshot_download (requires HF token + accepted
terms on stabilityai/stable-audio-open-1.0).
Source code in fastvideo/pipelines/basic/magi_human/stages/audio_decoding.py
fastvideo.pipelines.basic.magi_human.stages.denoising
¶
Joint-modality denoising stage for daVinci-MagiHuman base text-to-AV.
Runs the FlowUniPC denoise loop with CFG=2 over video + audio latents
jointly. Text embeddings are already pad-or-trimmed to t5_gemma_target_length
by MagiHumanLatentPreparationStage; the original context lengths are
stashed on the batch as magi_original_text_lens / magi_original_neg_text_lens.
Classes¶
fastvideo.pipelines.basic.magi_human.stages.denoising.MagiHumanDenoisingStage
¶
MagiHumanDenoisingStage(transformer, scheduler, patch_size: tuple[int, int, int] = (1, 2, 2), video_in_channels: int = 192, audio_in_channels: int = 64, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, coords_style: str = 'v2', video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0)
Bases: PipelineStage
UniPC-flow joint denoising with CFG=2 over (video, audio) latents.
Source code in fastvideo/pipelines/basic/magi_human/stages/denoising.py
Functions¶
fastvideo.pipelines.basic.magi_human.stages.latent_preparation
¶
Latent preparation stage for daVinci-MagiHuman base text-to-AV.
Produces
- random video latent of shape
[1, z_dim, latent_T, latent_H, latent_W], - random audio latent of shape
[1, num_frames, 64](the DiT jointly denoises both modalities), - padded T5-Gemma text embedding (target length 640) plus the original (pre-pad) context length, which the UniPC + CFG loop needs so the unconditional path sees the same padded length.
Also stakes out the per-token coords / modality map that the DiT consumes
(replicates the reference MagiDataProxy.process_input).
Classes¶
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.MagiHumanLatentPreparationStage
¶
MagiHumanLatentPreparationStage(vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, patch_size: tuple[int, int, int] = (1, 2, 2), fps: int = 25, t5_gemma_target_length: int = 640, coords_style: Literal['v1', 'v2'] = 'v2', text_offset: int = 0, audio_in_channels: int = 64)
Bases: PipelineStage
Prepare latents, coords, modality maps, and padded text embed.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.StaticPackedInputs
¶
StaticPackedInputs(video_tokens: Tensor, audio_tokens: Tensor, video_coords: Tensor, audio_coords: Tensor, video_mm: Tensor, audio_mm: Tensor, max_ch: int)
Step-invariant packed inputs: video+audio tokens, coords, modality map.
Computed once before the denoise loop; reused for every cond/uncond call. Text tokens are NOT included here because cond/uncond have different lengths.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.StaticPackedLayout
¶
StaticPackedLayout(video_coords: Tensor, audio_coords: Tensor, video_mm: Tensor, audio_mm: Tensor, max_ch: int, video_token_num: int, audio_feat_len: int)
Step- and value-invariant portion of the static packed inputs.
Coords, modality maps, and the channel-padding width depend only on the latent shape, audio length, channel widths, and patch sizes — all fixed for a single generation. Precompute once before the denoise loop and reuse on every step. Only the per-step token tensors must be rebuilt.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
Functions¶
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.assemble_packed_inputs
¶
assemble_packed_inputs(static: StaticPackedInputs, txt_feat: Tensor, txt_feat_len: int, coords_style: Literal['v1', 'v2'] = 'v2', text_offset: int = 0) -> tuple[Tensor, Tensor, Tensor]
Attach per-call text tokens to the precomputed static packed inputs.
Returns (token_seq, coords, modality_map) ready for the DiT.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.build_packed_inputs
¶
build_packed_inputs(video_latent: Tensor, audio_latent: Tensor, audio_feat_len: int, txt_feat: Tensor, txt_feat_len: int, patch_size: tuple[int, int, int], coords_style: Literal['v1', 'v2'] = 'v2', text_offset: int = 0) -> tuple[Tensor, Tensor, Tensor]
Build the full packed token stream in one call (backwards-compat wrapper).
Equivalent to assemble_packed_inputs(build_static_packed_inputs(...), ...). Prefer calling the two helpers separately when the static portion can be reused across multiple calls (e.g. cond/uncond in the denoise loop).
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.build_static_packed_inputs
¶
build_static_packed_inputs(video_latent: Tensor, audio_latent: Tensor, audio_feat_len: int, patch_size: tuple[int, int, int], coords_style: Literal['v1', 'v2'] = 'v2', layout: StaticPackedLayout | None = None) -> StaticPackedInputs
Build the step-invariant portion of the packed token stream.
Returns video+audio tokens (padded to a common channel width), their coords, and their modality slices. Text is excluded because cond/uncond differ in length; call assemble_packed_inputs to attach text per call.
Mirrors SingleData.token_sequence / coords_mapping / modality_mapping in inference/pipeline/data_proxy.py, minus the text portion.
When layout is provided, coords / modality maps / max_ch are taken
from the precomputed values and only the per-step token tensors are
rebuilt; this is the hot-path call from the denoising loop. When
layout is None the function recomputes everything from scratch
(e.g. for one-shot tests via build_packed_inputs).
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 | |
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.precompute_static_packed_layout
¶
precompute_static_packed_layout(latent_shape: tuple[int, int, int, int, int], audio_feat_len: int, z_dim: int, audio_in_channels: int, patch_size: tuple[int, int, int], coords_style: Literal['v1', 'v2'] = 'v2', device: device | None = None, dtype: dtype = float32) -> StaticPackedLayout
Precompute the invariant fields used by build_static_packed_inputs.
Arguments are derived from configs and latent shape — none depend on
the current denoising-step values. Call this once in the latent
preparation stage (or any pre-loop site) and pass the result via the
layout= arg of build_static_packed_inputs to skip the
meshgrid/full() work on every step.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.unpack_tokens
¶
unpack_tokens(output: Tensor, video_token_num: int, audio_feat_len: int, video_in_channels: int, audio_in_channels: int, latent_shape: tuple[int, int, int, int, int], patch_size: tuple[int, int, int]) -> tuple[Tensor, Tensor]
Inverse of build_packed_inputs for the DiT output.
Splits the flat output back into a video latent (un-patched into B C T H W) and an audio latent (B, L, 64).
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.reference_image
¶
Reference-image encoding for MagiHuman TI2V.
The upstream daVinci-MagiHuman TI2V path encodes the user image through the
Wan VAE and overwrites the first denoising latent frame with that clean latent
at every step. This stage mirrors MagiEvaluator.encode_image and stashes the
normalized latent on batch.image_latent for the latent-prep and denoise stages.
Classes¶
fastvideo.pipelines.basic.magi_human.stages.reference_image.MagiHumanReferenceImageStage
¶
Bases: PipelineStage
Encode a TI2V reference image into the first-frame video latent.
Source code in fastvideo/pipelines/basic/magi_human/stages/reference_image.py
Functions¶
fastvideo.pipelines.basic.magi_human.stages.sr_denoising
¶
SR video-only denoising stage for daVinci-MagiHuman SR-540p.
Classes¶
fastvideo.pipelines.basic.magi_human.stages.sr_denoising.MagiHumanSRDenoisingStage
¶
MagiHumanSRDenoisingStage(transformer, scheduler, patch_size: tuple[int, int, int] = (1, 2, 2), video_in_channels: int = 192, audio_in_channels: int = 64, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, cfg_number: int = 2, coords_style: str = 'v1')
Bases: PipelineStage
Denoise only the SR video latent; audio passes through unchanged.
Source code in fastvideo/pipelines/basic/magi_human/stages/sr_denoising.py
Functions¶
fastvideo.pipelines.basic.magi_human.stages.sr_latent_preparation
¶
Super-resolution latent preparation for daVinci-MagiHuman SR-540p.
Classes¶
fastvideo.pipelines.basic.magi_human.stages.sr_latent_preparation.MagiHumanSRLatentPreparationStage
¶
MagiHumanSRLatentPreparationStage(vae: Any, vae_stride: tuple[int, int, int] = (4, 16, 16), patch_size: tuple[int, int, int] = (1, 2, 2), noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_height: int = 512, sr_width: int = 896, vae_scale_factor: int = 16)
Bases: PipelineStage
Upsample base latents, add SR noise, and refresh SR conditioning.
Source code in fastvideo/pipelines/basic/magi_human/stages/sr_latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.sr_latent_preparation.ZeroSNRDDPMDiscretization
¶
ZeroSNRDDPMDiscretization(linear_start: float = 0.00085, linear_end: float = 0.012, num_timesteps: int = 1000, shift_scale: float = 1.0, keep_start: bool = False, post_shift: bool = False)
Upstream ZeroSNR schedule used to corrupt interpolated SR latents.