pipeline_configs
¶
PipelineConfig for the daVinci-MagiHuman base text-to-AV pipeline.
Classes¶
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanBaseConfig
dataclass
¶
MagiHumanBaseConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0)
Bases: PipelineConfig
Base MagiHuman text-to-AV pipeline config (prompt → video + audio).
MagiHuman's base model is a joint audio-visual generator. This config
wires up both the video VAE (Wan 2.2 TI2V-5B) and the audio VAE
(Stable Audio Open 1.0); the pipeline produces an mp4 with a muxed
audio track. The framework's WorkloadType enum has no T2AV
variant yet, so the registry entry uses WorkloadType.T2V as a
placeholder.
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanBaseI2VConfig
dataclass
¶
MagiHumanBaseI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, image_conditioning: bool = True)
Bases: MagiHumanBaseConfig
Base MagiHuman text+image-to-AV pipeline config.
TI2V reuses the T2V DiT weights; the only pipeline-side difference is that a reference image is encoded with the Wan VAE and reinserted into the first video-latent frame before every denoise step.
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanDistillConfig
dataclass
¶
MagiHumanDistillConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 8, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 1, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0)
Bases: MagiHumanBaseConfig
DMD-2 distilled MagiHuman text-to-AV pipeline config.
Same arch as base (identical 331 keys, same shapes, same module tree),
but trained via DMD-2 for 8-step inference without classifier-free
guidance. Weights are stored in fp32 upstream; the conversion script's
--cast-bf16 flag reduces the checkpoint to ~30 GB on disk.
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanDistillI2VConfig
dataclass
¶
MagiHumanDistillI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 8, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 1, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, image_conditioning: bool = True)
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR1080pConfig
dataclass
¶
MagiHumanSR1080pConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 1080, sr_width: int = 1920, sr_local_attn_layers: tuple[int, ...] = _SR_1080P_LOCAL_ATTN_LAYERS)
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR1080pI2VConfig
dataclass
¶
MagiHumanSR1080pI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 1080, sr_width: int = 1920, sr_local_attn_layers: tuple[int, ...] = _SR_1080P_LOCAL_ATTN_LAYERS, image_conditioning: bool = True)
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR540pConfig
dataclass
¶
MagiHumanSR540pConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 512, sr_width: int = 896, sr_local_attn_layers: tuple[int, ...] = ())
fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR540pI2VConfig
dataclass
¶
MagiHumanSR540pI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 512, sr_width: int = 896, sr_local_attn_layers: tuple[int, ...] = (), image_conditioning: bool = True)
Functions¶
fastvideo.pipelines.basic.magi_human.pipeline_configs.t5gemma_postprocess_text
¶
Return per-prompt last_hidden_state as a batched [B, L, D] tensor.
MagiHuman pads/trims the embedding to a fixed length in its own
pad_or_trim helper at pipeline time. Here we simply hand through
whatever the tokenizer produced — the latent-prep stage is responsible
for pad/trim so that the original context length can be preserved.