Skip to content

magi_human

Architecture / model config for the daVinci-MagiHuman DiT.

The MagiHuman base DiT is a 15B-parameter single-stream transformer that jointly denoises video, audio, and text tokens in one flat sequence. Layout details verified against GAIR/daVinci-MagiHuman's base/ shards (2026-04-24).

This file captures only configuration. The module implementation lives in fastvideo/models/dits/magi_human.py and the pipeline wiring in fastvideo/pipelines/basic/magi_human/.

Classes

fastvideo.configs.models.dits.magi_human.MagiHumanArchConfig dataclass

MagiHumanArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), _fsdp_shard_conditions: list = (lambda: [_is_block_layer])(), _compile_conditions: list = list(), param_names_mapping: dict = dict(), reverse_param_names_mapping: dict = dict(), lora_param_names_mapping: dict = dict(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (SAGE_ATTN, FLASH_ATTN, TORCH_SDPA, VIDEO_SPARSE_ATTN, VMOBA_ATTN, SAGE_ATTN_THREE, SLA_ATTN, SAGE_SLA_ATTN), hidden_size: int = 5120, num_attention_heads: int = 0, num_channels_latents: int = 0, in_channels: int = 0, out_channels: int = 0, exclude_lora_layers: list[str] = list(), boundary_ratio: float | None = None, num_layers: int = 40, head_dim: int = 128, num_query_groups: int = 8, video_in_channels: int = 192, audio_in_channels: int = 64, text_in_channels: int = 3584, mm_layers: tuple[int, ...] = (0, 1, 2, 3, 36, 37, 38, 39), local_attn_layers: tuple[int, ...] = (), gelu7_layers: tuple[int, ...] = (0, 1, 2, 3), post_norm_layers: tuple[int, ...] = (), enable_attn_gating: bool = True, activation_type: str = 'swiglu7', patch_size: tuple[int, int, int] = (1, 2, 2), spatial_rope_interpolation: str = 'extra', tread_selection_rate: float = 0.5, tread_start_layer_idx: int = 2, tread_end_layer_idx: int = 25, num_heads_kv: int = 0)

Bases: DiTArchConfig

MagiHuman base DiT architecture constants.

Scope contract: fields here must match the transformer/config.json emitted by scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py 1:1, and both are sourced from the upstream Python reference inference/common/config.py::ModelConfig (the HF root config.json is empty so the Python source is canonical). Pipeline-level knobs (VAE stride, fps, num_inference_steps, CFG scales, flow_shift, t5_gemma_target_length) and data-proxy knobs (coords_style, frame_receptive_field, ref_audio_offset, text_offset) live on MagiHumanBaseConfig, NOT here.

param_names_mapping is intentionally empty: the FastVideo implementation keeps the same module tree as the reference (adapter.*, block.layers.<i>.*, final_linear_{video,audio}.*, final_norm_{video,audio}.*), so converted weights load directly.