Skip to content

pipeline_configs

PipelineConfig for the daVinci-MagiHuman base text-to-AV pipeline.

Classes

fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanBaseConfig dataclass

MagiHumanBaseConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0)

Bases: PipelineConfig

Base MagiHuman text-to-AV pipeline config (prompt → video + audio).

MagiHuman's base model is a joint audio-visual generator. This config wires up both the video VAE (Wan 2.2 TI2V-5B) and the audio VAE (Stable Audio Open 1.0); the pipeline produces an mp4 with a muxed audio track. The framework's WorkloadType enum has no T2AV variant yet, so the registry entry uses WorkloadType.T2V as a placeholder.

fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanBaseI2VConfig dataclass

MagiHumanBaseI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, image_conditioning: bool = True)

Bases: MagiHumanBaseConfig

Base MagiHuman text+image-to-AV pipeline config.

TI2V reuses the T2V DiT weights; the only pipeline-side difference is that a reference image is encoded with the Wan VAE and reinserted into the first video-latent frame before every denoise step.

fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanDistillConfig dataclass

MagiHumanDistillConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 8, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 1, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0)

Bases: MagiHumanBaseConfig

DMD-2 distilled MagiHuman text-to-AV pipeline config.

Same arch as base (identical 331 keys, same shapes, same module tree), but trained via DMD-2 for 8-step inference without classifier-free guidance. Weights are stored in fp32 upstream; the conversion script's --cast-bf16 flag reduces the checkpoint to ~30 GB on disk.

fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanDistillI2VConfig dataclass

MagiHumanDistillI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 8, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 1, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, image_conditioning: bool = True)

Bases: MagiHumanDistillConfig

DMD-2 distilled MagiHuman text+image-to-AV pipeline config.

fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR1080pConfig dataclass

MagiHumanSR1080pConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 1080, sr_width: int = 1920, sr_local_attn_layers: tuple[int, ...] = _SR_1080P_LOCAL_ATTN_LAYERS)

Bases: MagiHumanSR540pConfig

Two-stage MagiHuman base + SR-1080p text-to-AV pipeline config.

fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR1080pI2VConfig dataclass

MagiHumanSR1080pI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 1080, sr_width: int = 1920, sr_local_attn_layers: tuple[int, ...] = _SR_1080P_LOCAL_ATTN_LAYERS, image_conditioning: bool = True)

Bases: MagiHumanSR1080pConfig

Two-stage MagiHuman base + SR-1080p text+image-to-AV config.

fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR540pConfig dataclass

MagiHumanSR540pConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 512, sr_width: int = 896, sr_local_attn_layers: tuple[int, ...] = ())

Bases: MagiHumanBaseConfig

Two-stage MagiHuman base + SR-540p text-to-AV pipeline config.

fastvideo.pipelines.basic.magi_human.pipeline_configs.MagiHumanSR540pI2VConfig dataclass

MagiHumanSR540pI2VConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: float | None = 5.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = MagiHumanVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = WanVAEConfig(), vae_precision: str = 'fp32', vae_tiling: bool = False, vae_sp: bool = False, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (T5GemmaEncoderConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5gemma_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, audio_vae_config: VAEConfig = OobleckVAEConfig(), precision: str = 'bf16', t5_gemma_target_length: int = 640, fps: int = 25, num_inference_steps: int = 32, video_txt_guidance_scale: float = 5.0, audio_txt_guidance_scale: float = 5.0, cfg_number: int = 2, vae_stride: tuple[int, int, int] = (4, 16, 16), z_dim: int = 48, frame_receptive_field: int = 11, coords_style: str = 'v2', ref_audio_offset: int = 1000, text_offset: int = 0, video_guidance_high_t_threshold: int = 500, video_guidance_low_t_value: float = 2.0, noise_value: int = 220, sr_audio_noise_scale: float = 0.7, sr_num_inference_steps: int = 5, sr_video_txt_guidance_scale: float = 3.5, use_cfg_trick: bool = True, cfg_trick_start_frame: int = 13, cfg_trick_value: float = 2.0, sr_height: int = 512, sr_width: int = 896, sr_local_attn_layers: tuple[int, ...] = (), image_conditioning: bool = True)

Bases: MagiHumanSR540pConfig

Two-stage MagiHuman base + SR-540p text+image-to-AV config.

Functions

fastvideo.pipelines.basic.magi_human.pipeline_configs.t5gemma_postprocess_text

t5gemma_postprocess_text(outputs: BaseEncoderOutput) -> Tensor

Return per-prompt last_hidden_state as a batched [B, L, D] tensor.

MagiHuman pads/trims the embedding to a fixed length in its own pad_or_trim helper at pipeline time. Here we simply hand through whatever the tokenizer produced — the latent-prep stage is responsible for pad/trim so that the original context length can be preserved.

Source code in fastvideo/pipelines/basic/magi_human/pipeline_configs.py
def t5gemma_postprocess_text(outputs: BaseEncoderOutput) -> torch.Tensor:
    """Return per-prompt last_hidden_state as a batched [B, L, D] tensor.

    MagiHuman pads/trims the embedding to a fixed length in its own
    `pad_or_trim` helper at pipeline time. Here we simply hand through
    whatever the tokenizer produced — the latent-prep stage is responsible
    for pad/trim so that the original context length can be preserved.
    """
    hidden = outputs.last_hidden_state
    assert torch.isnan(hidden).sum() == 0
    # Keep the shape the tokenizer emitted; the pipeline stage handles
    # pad-or-trim to t5_gemma_target_length=640.
    return hidden