magi_human
¶
Modules¶
fastvideo.pipelines.basic.magi_human.magi_human_pipeline
¶
MagiHuman base text-to-AV pipeline.
Top-level composition for the daVinci-MagiHuman base model. Wires:
InputValidationStage -> TextEncodingStage (T5-Gemma)
-> MagiHumanLatentPreparationStage
-> MagiHumanDenoisingStage
-> DecodingStage (Wan 2.2 TI2V-5B VAE decode for video)
-> MagiHumanAudioDecodingStage (Stable Audio Open 1.0 VAE decode)
The base checkpoint is a joint audio-visual generator; both the video and audio paths run in the denoising loop and both are decoded.
load_modules is overridden so the four cross-variant shared components
(text_encoder, tokenizer, audio_vae, video vae) lazy-load from their
canonical upstream HF repos at first build time instead of being
bundled inside every converted MagiHuman variant. This keeps each
variant's converted repo at ~5-30 GB (transformer + scheduler +
model_index.json) instead of ~30-55 GB, and lets all variants share
the same ~25 GB of cached upstream weights.
Classes¶
fastvideo.pipelines.basic.magi_human.magi_human_pipeline.MagiHumanI2VPipeline
¶
MagiHumanI2VPipeline(model_path: str, fastvideo_args: FastVideoArgs | TrainingArgs, required_config_modules: list[str] | None = None, loaded_modules: dict[str, Module] | None = None)
Bases: MagiHumanPipeline
MagiHuman text+image-to-AV pipeline using the T2V DiT weights.
Source code in fastvideo/pipelines/composed_pipeline_base.py
fastvideo.pipelines.basic.magi_human.magi_human_pipeline.MagiHumanPipeline
¶
MagiHumanPipeline(model_path: str, fastvideo_args: FastVideoArgs | TrainingArgs, required_config_modules: list[str] | None = None, loaded_modules: dict[str, Module] | None = None)
Bases: ComposedPipelineBase
Base MagiHuman text-to-AV pipeline (no LoRA, no distill, no SR).
Source code in fastvideo/pipelines/composed_pipeline_base.py
Functions¶
fastvideo.pipelines.basic.magi_human.magi_human_pipeline.MagiHumanPipeline.load_modules
¶load_modules(fastvideo_args: FastVideoArgs, loaded_modules: dict[str, Any] | None = None) -> dict[str, Any]
Load the variant-specific transformer + scheduler from the converted MagiHuman repo and lazy-load the four cross-variant shared components from their canonical upstream HF repos:
* text_encoder, tokenizer -> google/t5gemma-9b-9b-ul2
(gated, requires HF token with accepted terms of use)
* audio_vae -> stabilityai/stable-audio-open-1.0 (gated)
* vae -> Wan-AI/Wan2.2-TI2V-5B-Diffusers
Backwards-compatible with bundled converted repos: if any of
these subfolders is present locally and listed in
model_index.json, the standard component loader picks it up
via super(). Otherwise the loader is told to skip the entry and
we lazy-load it here.
Source code in fastvideo/pipelines/basic/magi_human/magi_human_pipeline.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | |
fastvideo.pipelines.basic.magi_human.magi_human_pipeline.MagiHumanSR1080pI2VPipeline
¶
MagiHumanSR1080pI2VPipeline(model_path: str, fastvideo_args: FastVideoArgs | TrainingArgs, required_config_modules: list[str] | None = None, loaded_modules: dict[str, Module] | None = None)
Bases: MagiHumanSRI2VPipeline
Two-stage MagiHuman base + SR-1080p text+image-to-AV pipeline.
Source code in fastvideo/pipelines/composed_pipeline_base.py
fastvideo.pipelines.basic.magi_human.magi_human_pipeline.MagiHumanSR1080pPipeline
¶
MagiHumanSR1080pPipeline(model_path: str, fastvideo_args: FastVideoArgs | TrainingArgs, required_config_modules: list[str] | None = None, loaded_modules: dict[str, Module] | None = None)
Bases: MagiHumanSRPipeline
Two-stage MagiHuman base + SR-1080p text-to-AV pipeline.
The stage chain is identical to SR-540p. The paired pipeline config enables block-sparse local-window attention on 32 SR-DiT layers and requests the 1080p latent target.
Source code in fastvideo/pipelines/composed_pipeline_base.py
fastvideo.pipelines.basic.magi_human.magi_human_pipeline.MagiHumanSRI2VPipeline
¶
MagiHumanSRI2VPipeline(model_path: str, fastvideo_args: FastVideoArgs | TrainingArgs, required_config_modules: list[str] | None = None, loaded_modules: dict[str, Module] | None = None)
Bases: MagiHumanSRPipeline
Two-stage MagiHuman base + SR-540p text+image-to-AV pipeline.
Source code in fastvideo/pipelines/composed_pipeline_base.py
fastvideo.pipelines.basic.magi_human.magi_human_pipeline.MagiHumanSRPipeline
¶
MagiHumanSRPipeline(model_path: str, fastvideo_args: FastVideoArgs | TrainingArgs, required_config_modules: list[str] | None = None, loaded_modules: dict[str, Module] | None = None)
Bases: MagiHumanPipeline
Two-stage MagiHuman base + SR-540p text-to-AV pipeline.
Source code in fastvideo/pipelines/composed_pipeline_base.py
Functions¶
fastvideo.pipelines.basic.magi_human.pipeline_configs
¶
PipelineConfig for the daVinci-MagiHuman base text-to-AV pipeline.
Classes¶
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanBaseConfig
dataclass
¶
MagiHumanBaseConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0)
Bases: PipelineConfig
Base MagiHuman text-to-AV pipeline config (prompt → video + audio).
MagiHuman's base model is a joint audio-visual generator. This config
wires up both the video VAE (Wan 2.2 TI2V-5B) and the audio VAE
(Stable Audio Open 1.0); the pipeline produces an mp4 with a muxed
audio track. The framework's WorkloadType enum has no T2AV
variant yet, so the registry entry uses WorkloadType.T2V as a
placeholder.
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanBaseI2VConfig
dataclass
¶
MagiHumanBaseI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, image_conditioning: bool = True)
Bases: MagiHumanBaseConfig
Base MagiHuman text+image-to-AV pipeline config.
TI2V reuses the T2V DiT weights; the only pipeline-side difference is that a reference image is encoded with the Wan VAE and reinserted into the first video-latent frame before every denoise step.
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanDistillConfig
dataclass
¶
MagiHumanDistillConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 8, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 1, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0)
Bases: MagiHumanBaseConfig
DMD-2 distilled MagiHuman text-to-AV pipeline config.
Same arch as base (identical 331 keys, same shapes, same module tree),
but trained via DMD-2 for 8-step inference without classifier-free
guidance. Weights are stored in fp32 upstream; the conversion script's
--cast-bf16 flag reduces the checkpoint to ~30 GB on disk.
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanDistillI2VConfig
dataclass
¶
MagiHumanDistillI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 8, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 1, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, image_conditioning: bool = True)
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR1080pConfig
dataclass
¶
MagiHumanSR1080pConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 1080, sr_width: int = 1920, sr_local_attn_layers: tuple[int, ...] = _SR_1080P_LOCAL_ATTN_LAYERS)
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR1080pI2VConfig
dataclass
¶
MagiHumanSR1080pI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 1080, sr_width: int = 1920, sr_local_attn_layers: tuple[int, ...] = _SR_1080P_LOCAL_ATTN_LAYERS, image_conditioning: bool = True)
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR540pConfig
dataclass
¶
MagiHumanSR540pConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 512, sr_width: int = 896, sr_local_attn_layers: tuple[int, ...] = ())
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR540pI2VConfig
dataclass
¶
MagiHumanSR540pI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 512, sr_width: int = 896, sr_local_attn_layers: tuple[int, ...] = (), image_conditioning: bool = True)
Functions¶
fastvideo.pipelines.basic.magi_human.pipeline_configs.t5gemma_postprocess_text
¶
Return per-prompt last_hidden_state as a batched [B, L, D] tensor.
MagiHuman pads/trims the embedding to a fixed length in its own
pad_or_trim helper at pipeline time. Here we simply hand through
whatever the tokenizer produced — the latent-prep stage is responsible
for pad/trim so that the original context length can be preserved.
Source code in fastvideo/pipelines/basic/magi_human/pipeline_configs.py
fastvideo.pipelines.basic.magi_human.presets
¶
Presets for the daVinci-MagiHuman pipelines.
Classes¶
fastvideo.pipelines.basic.magi_human.stages
¶
Classes¶
fastvideo.pipelines.basic.magi_human.stages.MagiHumanAudioDecodingStage
¶
MagiHumanAudioDecodingStage(audio_vae, time_stretching: float = _UPSTREAM_AUDIO_TIME_STRETCH)
Bases: PipelineStage
Decode batch.audio_latents to a waveform using Stable Audio's VAE.
The VAE is loaded lazily by SAAudioVAEModel.sa_audio_vae_model — the
first call triggers a snapshot_download (requires HF token + accepted
terms on stabilityai/stable-audio-open-1.0).
Source code in fastvideo/pipelines/basic/magi_human/stages/audio_decoding.py
fastvideo.pipelines.basic.magi_human.stages.MagiHumanDenoisingStage
¶
MagiHumanDenoisingStage(transformer, scheduler, patch_size: tuple[int, int, int] = (1, 2, 2), video_in_channels: int = 192, audio_in_channels: int = 64, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, coords_style: str = 'v2', video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0)
Bases: PipelineStage
UniPC-flow joint denoising with CFG=2 over (video, audio) latents.
Source code in fastvideo/pipelines/basic/magi_human/stages/denoising.py
fastvideo.pipelines.basic.magi_human.stages.MagiHumanLatentPreparationStage
¶
MagiHumanLatentPreparationStage(vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, patch_size: tuple[int, int, int] = (1, 2, 2), fps: int = 25, t5_gemma_target_length: int = 640, coords_style: Literal['v1', 'v2'] = 'v2', text_offset: int = 0, audio_in_channels: int = 64)
Bases: PipelineStage
Prepare latents, coords, modality maps, and padded text embed.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.MagiHumanReferenceImageStage
¶
Bases: PipelineStage
Encode a TI2V reference image into the first-frame video latent.
Source code in fastvideo/pipelines/basic/magi_human/stages/reference_image.py
fastvideo.pipelines.basic.magi_human.stages.MagiHumanSRDenoisingStage
¶
MagiHumanSRDenoisingStage(transformer, scheduler, patch_size: tuple[int, int, int] = (1, 2, 2), video_in_channels: int = 192, audio_in_channels: int = 64, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, cfg_number: int = 2, coords_style: str = 'v1')
Bases: PipelineStage
Denoise only the SR video latent; audio passes through unchanged.
Source code in fastvideo/pipelines/basic/magi_human/stages/sr_denoising.py
fastvideo.pipelines.basic.magi_human.stages.MagiHumanSRLatentPreparationStage
¶
MagiHumanSRLatentPreparationStage(vae: Any, vae_stride: tuple[int, int, int] = (4, 16, 16), patch_size: tuple[int, int, int] = (1, 2, 2), noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_height: int = 512, sr_width: int = 896, vae_scale_factor: int = 16)
Bases: PipelineStage
Upsample base latents, add SR noise, and refresh SR conditioning.
Source code in fastvideo/pipelines/basic/magi_human/stages/sr_latent_preparation.py
Modules¶
fastvideo.pipelines.basic.magi_human.stages.audio_decoding
¶
Audio decoding stage for daVinci-MagiHuman.
Takes the denoised audio latent that MagiHumanDenoisingStage leaves
on batch.audio_latents and decodes it to a waveform using the
Stable Audio Open 1.0 VAE. Mirrors the upstream post-process path
(see MagiEvaluator.post_process in
daVinci-MagiHuman/inference/pipeline/video_generate.py:503):
latent_audio.squeeze(0) # (L, C_latent)
audio = self.audio_vae.decode(latent_audio.T) # (1, audio_ch, samples)
audio = audio.squeeze(0).T.cpu().numpy() # (samples, audio_ch)
audio = resample_audio_sinc(audio, _UPSTREAM_AUDIO_TIME_STRETCH)
The stage stores the resampled waveform on batch.extra["audio"]
(shape [samples, audio_channels]) and the sample rate on
batch.extra["audio_sample_rate"]. FastVideo's VideoGenerator._mux_audio
then reads those, writes a temp wav, and muxes it into the output mp4
via PyAV — same plumbing LTX-2 and Stable Audio use.
Classes¶
fastvideo.pipelines.basic.magi_human.stages.audio_decoding.MagiHumanAudioDecodingStage
¶MagiHumanAudioDecodingStage(audio_vae, time_stretching: float = _UPSTREAM_AUDIO_TIME_STRETCH)
Bases: PipelineStage
Decode batch.audio_latents to a waveform using Stable Audio's VAE.
The VAE is loaded lazily by SAAudioVAEModel.sa_audio_vae_model — the
first call triggers a snapshot_download (requires HF token + accepted
terms on stabilityai/stable-audio-open-1.0).
Source code in fastvideo/pipelines/basic/magi_human/stages/audio_decoding.py
fastvideo.pipelines.basic.magi_human.stages.denoising
¶
Joint-modality denoising stage for daVinci-MagiHuman base text-to-AV.
Runs the FlowUniPC denoise loop with CFG=2 over video + audio latents
jointly. Text embeddings are already pad-or-trimmed to t5_gemma_target_length
by MagiHumanLatentPreparationStage; the original context lengths are
stashed on the batch as magi_original_text_lens / magi_original_neg_text_lens.
Classes¶
fastvideo.pipelines.basic.magi_human.stages.denoising.MagiHumanDenoisingStage
¶MagiHumanDenoisingStage(transformer, scheduler, patch_size: tuple[int, int, int] = (1, 2, 2), video_in_channels: int = 192, audio_in_channels: int = 64, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, coords_style: str = 'v2', video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0)
Bases: PipelineStage
UniPC-flow joint denoising with CFG=2 over (video, audio) latents.
Source code in fastvideo/pipelines/basic/magi_human/stages/denoising.py
Functions¶
fastvideo.pipelines.basic.magi_human.stages.latent_preparation
¶
Latent preparation stage for daVinci-MagiHuman base text-to-AV.
Produces
- random video latent of shape
[1, z_dim, latent_T, latent_H, latent_W], - random audio latent of shape
[1, num_frames, 64](the DiT jointly denoises both modalities), - padded T5-Gemma text embedding (target length 640) plus the original (pre-pad) context length, which the UniPC + CFG loop needs so the unconditional path sees the same padded length.
Also stakes out the per-token coords / modality map that the DiT consumes
(replicates the reference MagiDataProxy.process_input).
Classes¶
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.MagiHumanLatentPreparationStage
¶MagiHumanLatentPreparationStage(vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, patch_size: tuple[int, int, int] = (1, 2, 2), fps: int = 25, t5_gemma_target_length: int = 640, coords_style: Literal['v1', 'v2'] = 'v2', text_offset: int = 0, audio_in_channels: int = 64)
Bases: PipelineStage
Prepare latents, coords, modality maps, and padded text embed.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.StaticPackedInputs
¶StaticPackedInputs(video_tokens: Tensor, audio_tokens: Tensor, video_coords: Tensor, audio_coords: Tensor, video_mm: Tensor, audio_mm: Tensor, max_ch: int)
Step-invariant packed inputs: video+audio tokens, coords, modality map.
Computed once before the denoise loop; reused for every cond/uncond call. Text tokens are NOT included here because cond/uncond have different lengths.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.StaticPackedLayout
¶StaticPackedLayout(video_coords: Tensor, audio_coords: Tensor, video_mm: Tensor, audio_mm: Tensor, max_ch: int, video_token_num: int, audio_feat_len: int)
Step- and value-invariant portion of the static packed inputs.
Coords, modality maps, and the channel-padding width depend only on the latent shape, audio length, channel widths, and patch sizes — all fixed for a single generation. Precompute once before the denoise loop and reuse on every step. Only the per-step token tensors must be rebuilt.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
Functions¶
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.assemble_packed_inputs
¶assemble_packed_inputs(static: StaticPackedInputs, txt_feat: Tensor, txt_feat_len: int, coords_style: Literal['v1', 'v2'] = 'v2', text_offset: int = 0) -> tuple[Tensor, Tensor, Tensor]
Attach per-call text tokens to the precomputed static packed inputs.
Returns (token_seq, coords, modality_map) ready for the DiT.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.build_packed_inputs
¶build_packed_inputs(video_latent: Tensor, audio_latent: Tensor, audio_feat_len: int, txt_feat: Tensor, txt_feat_len: int, patch_size: tuple[int, int, int], coords_style: Literal['v1', 'v2'] = 'v2', text_offset: int = 0) -> tuple[Tensor, Tensor, Tensor]
Build the full packed token stream in one call (backwards-compat wrapper).
Equivalent to assemble_packed_inputs(build_static_packed_inputs(...), ...). Prefer calling the two helpers separately when the static portion can be reused across multiple calls (e.g. cond/uncond in the denoise loop).
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.build_static_packed_inputs
¶build_static_packed_inputs(video_latent: Tensor, audio_latent: Tensor, audio_feat_len: int, patch_size: tuple[int, int, int], coords_style: Literal['v1', 'v2'] = 'v2', layout: StaticPackedLayout | None = None) -> StaticPackedInputs
Build the step-invariant portion of the packed token stream.
Returns video+audio tokens (padded to a common channel width), their coords, and their modality slices. Text is excluded because cond/uncond differ in length; call assemble_packed_inputs to attach text per call.
Mirrors SingleData.token_sequence / coords_mapping / modality_mapping in inference/pipeline/data_proxy.py, minus the text portion.
When layout is provided, coords / modality maps / max_ch are taken
from the precomputed values and only the per-step token tensors are
rebuilt; this is the hot-path call from the denoising loop. When
layout is None the function recomputes everything from scratch
(e.g. for one-shot tests via build_packed_inputs).
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 | |
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.precompute_static_packed_layout
¶precompute_static_packed_layout(latent_shape: tuple[int, int, int, int, int], audio_feat_len: int, z_dim: int, audio_in_channels: int, patch_size: tuple[int, int, int], coords_style: Literal['v1', 'v2'] = 'v2', device: device | None = None, dtype: dtype = float32) -> StaticPackedLayout
Precompute the invariant fields used by build_static_packed_inputs.
Arguments are derived from configs and latent shape — none depend on
the current denoising-step values. Call this once in the latent
preparation stage (or any pre-loop site) and pass the result via the
layout= arg of build_static_packed_inputs to skip the
meshgrid/full() work on every step.
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.latent_preparation.unpack_tokens
¶unpack_tokens(output: Tensor, video_token_num: int, audio_feat_len: int, video_in_channels: int, audio_in_channels: int, latent_shape: tuple[int, int, int, int, int], patch_size: tuple[int, int, int]) -> tuple[Tensor, Tensor]
Inverse of build_packed_inputs for the DiT output.
Splits the flat output back into a video latent (un-patched into B C T H W) and an audio latent (B, L, 64).
Source code in fastvideo/pipelines/basic/magi_human/stages/latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.reference_image
¶
Reference-image encoding for MagiHuman TI2V.
The upstream daVinci-MagiHuman TI2V path encodes the user image through the
Wan VAE and overwrites the first denoising latent frame with that clean latent
at every step. This stage mirrors MagiEvaluator.encode_image and stashes the
normalized latent on batch.image_latent for the latent-prep and denoise stages.
Classes¶
fastvideo.pipelines.basic.magi_human.stages.reference_image.MagiHumanReferenceImageStage
¶
Bases: PipelineStage
Encode a TI2V reference image into the first-frame video latent.
Source code in fastvideo/pipelines/basic/magi_human/stages/reference_image.py
Functions¶
fastvideo.pipelines.basic.magi_human.stages.sr_denoising
¶
SR video-only denoising stage for daVinci-MagiHuman SR-540p.
Classes¶
fastvideo.pipelines.basic.magi_human.stages.sr_denoising.MagiHumanSRDenoisingStage
¶MagiHumanSRDenoisingStage(transformer, scheduler, patch_size: tuple[int, int, int] = (1, 2, 2), video_in_channels: int = 192, audio_in_channels: int = 64, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, cfg_number: int = 2, coords_style: str = 'v1')
Bases: PipelineStage
Denoise only the SR video latent; audio passes through unchanged.
Source code in fastvideo/pipelines/basic/magi_human/stages/sr_denoising.py
Functions¶
fastvideo.pipelines.basic.magi_human.stages.sr_latent_preparation
¶
Super-resolution latent preparation for daVinci-MagiHuman SR-540p.
Classes¶
fastvideo.pipelines.basic.magi_human.stages.sr_latent_preparation.MagiHumanSRLatentPreparationStage
¶MagiHumanSRLatentPreparationStage(vae: Any, vae_stride: tuple[int, int, int] = (4, 16, 16), patch_size: tuple[int, int, int] = (1, 2, 2), noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_height: int = 512, sr_width: int = 896, vae_scale_factor: int = 16)
Bases: PipelineStage
Upsample base latents, add SR noise, and refresh SR conditioning.
Source code in fastvideo/pipelines/basic/magi_human/stages/sr_latent_preparation.py
fastvideo.pipelines.basic.magi_human.stages.sr_latent_preparation.ZeroSNRDDPMDiscretization
¶ZeroSNRDDPMDiscretization(linear_start: float = 0.00085, linear_end: float = 0.012, num_timesteps: int = 1000, shift_scale: float = 1.0, keep_start: bool = False, post_shift: bool = False)
Upstream ZeroSNR schedule used to corrupt interpolated SR latents.