models
¶
Classes¶
fastvideo.configs.models.DiTConfig
dataclass
¶
DiTConfig(arch_config: DiTArchConfig = DiTArchConfig(), prefix: str = '', quant_config: QuantizationConfig | None = None)
Bases: ModelConfig
Methods:¶
fastvideo.configs.models.DiTConfig.add_cli_args
staticmethod
¶
Add CLI arguments for DiTConfig fields
Source code in fastvideo/configs/models/dits/base.py
fastvideo.configs.models.VAEConfig
dataclass
¶
VAEConfig(arch_config: VAEArchConfig = VAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = True, use_temporal_tiling: bool = True, use_parallel_tiling: bool = True, use_temporal_scaling_frames: bool = True)
Bases: ModelConfig
Methods:¶
fastvideo.configs.models.VAEConfig.add_cli_args
staticmethod
¶
Add CLI arguments for VAEConfig fields
Source code in fastvideo/configs/models/vaes/base.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
Modules¶
fastvideo.configs.models.audio
¶
fastvideo.configs.models.dits
¶
Classes¶
fastvideo.configs.models.dits.Cosmos25VideoConfig
dataclass
¶
Cosmos25VideoConfig(arch_config: DiTArchConfig = Cosmos25ArchConfig(), prefix: str = 'Cosmos25', quant_config: QuantizationConfig | None = None)
fastvideo.configs.models.dits.Flux2Config
dataclass
¶
Flux2Config(arch_config: DiTArchConfig = Flux2ArchConfig(), prefix: str = 'Flux', quant_config: QuantizationConfig | None = None)
fastvideo.configs.models.dits.HunyuanGameCraftConfig
dataclass
¶
HunyuanGameCraftConfig(arch_config: DiTArchConfig = HunyuanGameCraftArchConfig(), prefix: str = 'HunyuanGameCraft', quant_config: QuantizationConfig | None = None)
fastvideo.configs.models.dits.LTX2VideoConfig
dataclass
¶
LTX2VideoConfig(arch_config: DiTArchConfig = LTX2VideoArchConfig(), prefix: str = 'ltx2', quant_config: QuantizationConfig | None = None)
fastvideo.configs.models.dits.LongCatVideoConfig
dataclass
¶
LongCatVideoConfig(arch_config: DiTArchConfig = LongCatVideoArchConfig(), prefix: str = 'longcat', quant_config: QuantizationConfig | None = None)
Modules¶
fastvideo.configs.models.dits.base
¶
Classes¶
fastvideo.configs.models.dits.base.DiTConfig
dataclass
¶DiTConfig(arch_config: DiTArchConfig = DiTArchConfig(), prefix: str = '', quant_config: QuantizationConfig | None = None)
Bases: ModelConfig
fastvideo.configs.models.dits.base.DiTConfig.add_cli_args
staticmethod
¶Add CLI arguments for DiTConfig fields
Source code in fastvideo/configs/models/dits/base.py
fastvideo.configs.models.dits.cosmos2_5
¶
Classes¶
fastvideo.configs.models.dits.cosmos2_5.Cosmos25ArchConfig
dataclass
¶Cosmos25ArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), _fsdp_shard_conditions: list = (lambda: [is_transformer_blocks])(), _compile_conditions: list = list(), param_names_mapping: dict = (lambda: {'^net\\.x_embedder\\.proj\\.1\\.(.*)$': 'patch_embed.proj.\\1', '^net\\.t_embedder\\.1\\.linear_1\\.(.*)$': 'time_embed.t_embedder.linear_1.\\1', '^net\\.t_embedder\\.1\\.linear_2\\.(.*)$': 'time_embed.t_embedder.linear_2.\\1', '^net\\.t_embedding_norm\\.(.*)$': 'time_embed.norm.\\1', '^net\\.crossattn_proj\\.0\\.weight$': 'crossattn_proj.0.weight', '^net\\.crossattn_proj\\.0\\.bias$': 'crossattn_proj.0.bias', '^net\\.blocks\\.(\\d+)\\.self_attn\\.q_proj\\.(.*)$': 'transformer_blocks.\\1.attn1.to_q.\\2', '^net\\.blocks\\.(\\d+)\\.self_attn\\.k_proj\\.(.*)$': 'transformer_blocks.\\1.attn1.to_k.\\2', '^net\\.blocks\\.(\\d+)\\.self_attn\\.v_proj\\.(.*)$': 'transformer_blocks.\\1.attn1.to_v.\\2', '^net\\.blocks\\.(\\d+)\\.self_attn\\.output_proj\\.(.*)$': 'transformer_blocks.\\1.attn1.to_out.\\2', '^net\\.blocks\\.(\\d+)\\.self_attn\\.q_norm\\.weight$': 'transformer_blocks.\\1.attn1.norm_q.weight', '^net\\.blocks\\.(\\d+)\\.self_attn\\.k_norm\\.weight$': 'transformer_blocks.\\1.attn1.norm_k.weight', '^net\\.blocks\\.(\\d+)\\.self_attn\\.q_norm\\._extra_state$': 'transformer_blocks.\\1.attn1.norm_q._extra_state', '^net\\.blocks\\.(\\d+)\\.self_attn\\.k_norm\\._extra_state$': 'transformer_blocks.\\1.attn1.norm_k._extra_state', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.q_proj\\.(.*)$': 'transformer_blocks.\\1.attn2.to_q.\\2', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.k_proj\\.(.*)$': 'transformer_blocks.\\1.attn2.to_k.\\2', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.v_proj\\.(.*)$': 'transformer_blocks.\\1.attn2.to_v.\\2', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.output_proj\\.(.*)$': 'transformer_blocks.\\1.attn2.to_out.\\2', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.q_norm\\.weight$': 'transformer_blocks.\\1.attn2.norm_q.weight', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.k_norm\\.weight$': 'transformer_blocks.\\1.attn2.norm_k.weight', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.q_norm\\._extra_state$': 'transformer_blocks.\\1.attn2.norm_q._extra_state', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.k_norm\\._extra_state$': 'transformer_blocks.\\1.attn2.norm_k._extra_state', '^net\\.blocks\\.(\\d+)\\.mlp\\.layer1\\.(.*)$': 'transformer_blocks.\\1.mlp.fc_in.\\2', '^net\\.blocks\\.(\\d+)\\.mlp\\.layer2\\.(.*)$': 'transformer_blocks.\\1.mlp.fc_out.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_self_attn\\.1\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_self_attn.1.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_self_attn\\.2\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_self_attn.2.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_cross_attn\\.1\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_cross_attn.1.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_cross_attn\\.2\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_cross_attn.2.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_mlp\\.1\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_mlp.1.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_mlp\\.2\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_mlp.2.\\2', '^net\\.blocks\\.(\\d+)\\.layer_norm_self_attn\\._extra_state$': 'transformer_blocks.\\1.norm1.norm._extra_state', '^net\\.blocks\\.(\\d+)\\.layer_norm_cross_attn\\._extra_state$': 'transformer_blocks.\\1.norm2.norm._extra_state', '^net\\.blocks\\.(\\d+)\\.layer_norm_mlp\\._extra_state$': 'transformer_blocks.\\1.norm3.norm._extra_state', '^net\\.final_layer\\.linear\\.(.*)$': 'final_layer.proj_out.\\1', '^net\\.final_layer\\.adaln_modulation\\.1\\.(.*)$': 'final_layer.linear_1.\\1', '^net\\.final_layer\\.adaln_modulation\\.2\\.(.*)$': 'final_layer.linear_2.\\1'})(), reverse_param_names_mapping: dict = dict(), lora_param_names_mapping: dict = (lambda: {'^transformer_blocks\\.(\\d+)\\.attn1\\.to_q\\.(.*)$': 'transformer_blocks.\\1.attn1.to_q.\\2', '^transformer_blocks\\.(\\d+)\\.attn1\\.to_k\\.(.*)$': 'transformer_blocks.\\1.attn1.to_k.\\2', '^transformer_blocks\\.(\\d+)\\.attn1\\.to_v\\.(.*)$': 'transformer_blocks.\\1.attn1.to_v.\\2', '^transformer_blocks\\.(\\d+)\\.attn1\\.to_out\\.(.*)$': 'transformer_blocks.\\1.attn1.to_out.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_q\\.(.*)$': 'transformer_blocks.\\1.attn2.to_q.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_k\\.(.*)$': 'transformer_blocks.\\1.attn2.to_k.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_v\\.(.*)$': 'transformer_blocks.\\1.attn2.to_v.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_out\\.(.*)$': 'transformer_blocks.\\1.attn2.to_out.\\2', '^transformer_blocks\\.(\\d+)\\.mlp\\.(.*)$': 'transformer_blocks.\\1.mlp.\\2'})(), cast_prompt_embeds_to_dit_dtype: bool = False, _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (SAGE_ATTN, FLASH_ATTN, TORCH_SDPA, VIDEO_SPARSE_ATTN, VMOBA_ATTN, SAGE_ATTN_THREE, SLA_ATTN, SAGE_SLA_ATTN), hidden_size: int = 0, num_attention_heads: int = 16, num_channels_latents: int = 0, in_channels: int = 16, out_channels: int = 16, exclude_lora_layers: list[str] = (lambda: ['embedder'])(), boundary_ratio: float | None = None, attention_head_dim: int = 128, num_layers: int = 28, mlp_ratio: float = 4.0, text_embed_dim: int = 1024, adaln_lora_dim: int = 256, use_adaln_lora: bool = True, max_size: tuple[int, int, int] = (128, 240, 240), patch_size: tuple[int, int, int] = (1, 2, 2), rope_scale: tuple[float, float, float] = (1.0, 3.0, 3.0), concat_padding_mask: bool = True, extra_pos_embed_type: str | None = None, use_crossattn_projection: bool = False, crossattn_proj_in_channels: int = 100352, rope_enable_fps_modulation: bool = True, qk_norm: str = 'rms_norm', eps: float = 1e-06)
Bases: DiTArchConfig
Configuration for Cosmos 2.5 architecture (MiniTrainDIT).
fastvideo.configs.models.dits.cosmos2_5.Cosmos25VideoConfig
dataclass
¶Cosmos25VideoConfig(arch_config: DiTArchConfig = Cosmos25ArchConfig(), prefix: str = 'Cosmos25', quant_config: QuantizationConfig | None = None)
fastvideo.configs.models.dits.cosmos2_5.Cosmos25_14BArchConfig
dataclass
¶Cosmos25_14BArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), _fsdp_shard_conditions: list = (lambda: [is_transformer_blocks])(), _compile_conditions: list = list(), param_names_mapping: dict = (lambda: {'^net\\.x_embedder\\.proj\\.1\\.(.*)$': 'patch_embed.proj.\\1', '^net\\.t_embedder\\.1\\.linear_1\\.(.*)$': 'time_embed.t_embedder.linear_1.\\1', '^net\\.t_embedder\\.1\\.linear_2\\.(.*)$': 'time_embed.t_embedder.linear_2.\\1', '^net\\.t_embedding_norm\\.(.*)$': 'time_embed.norm.\\1', '^net\\.crossattn_proj\\.0\\.weight$': 'crossattn_proj.0.weight', '^net\\.crossattn_proj\\.0\\.bias$': 'crossattn_proj.0.bias', '^net\\.blocks\\.(\\d+)\\.self_attn\\.q_proj\\.(.*)$': 'transformer_blocks.\\1.attn1.to_q.\\2', '^net\\.blocks\\.(\\d+)\\.self_attn\\.k_proj\\.(.*)$': 'transformer_blocks.\\1.attn1.to_k.\\2', '^net\\.blocks\\.(\\d+)\\.self_attn\\.v_proj\\.(.*)$': 'transformer_blocks.\\1.attn1.to_v.\\2', '^net\\.blocks\\.(\\d+)\\.self_attn\\.output_proj\\.(.*)$': 'transformer_blocks.\\1.attn1.to_out.\\2', '^net\\.blocks\\.(\\d+)\\.self_attn\\.q_norm\\.weight$': 'transformer_blocks.\\1.attn1.norm_q.weight', '^net\\.blocks\\.(\\d+)\\.self_attn\\.k_norm\\.weight$': 'transformer_blocks.\\1.attn1.norm_k.weight', '^net\\.blocks\\.(\\d+)\\.self_attn\\.q_norm\\._extra_state$': 'transformer_blocks.\\1.attn1.norm_q._extra_state', '^net\\.blocks\\.(\\d+)\\.self_attn\\.k_norm\\._extra_state$': 'transformer_blocks.\\1.attn1.norm_k._extra_state', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.q_proj\\.(.*)$': 'transformer_blocks.\\1.attn2.to_q.\\2', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.k_proj\\.(.*)$': 'transformer_blocks.\\1.attn2.to_k.\\2', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.v_proj\\.(.*)$': 'transformer_blocks.\\1.attn2.to_v.\\2', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.output_proj\\.(.*)$': 'transformer_blocks.\\1.attn2.to_out.\\2', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.q_norm\\.weight$': 'transformer_blocks.\\1.attn2.norm_q.weight', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.k_norm\\.weight$': 'transformer_blocks.\\1.attn2.norm_k.weight', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.q_norm\\._extra_state$': 'transformer_blocks.\\1.attn2.norm_q._extra_state', '^net\\.blocks\\.(\\d+)\\.cross_attn\\.k_norm\\._extra_state$': 'transformer_blocks.\\1.attn2.norm_k._extra_state', '^net\\.blocks\\.(\\d+)\\.mlp\\.layer1\\.(.*)$': 'transformer_blocks.\\1.mlp.fc_in.\\2', '^net\\.blocks\\.(\\d+)\\.mlp\\.layer2\\.(.*)$': 'transformer_blocks.\\1.mlp.fc_out.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_self_attn\\.1\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_self_attn.1.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_self_attn\\.2\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_self_attn.2.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_cross_attn\\.1\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_cross_attn.1.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_cross_attn\\.2\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_cross_attn.2.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_mlp\\.1\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_mlp.1.\\2', '^net\\.blocks\\.(\\d+)\\.adaln_modulation_mlp\\.2\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_mlp.2.\\2', '^net\\.blocks\\.(\\d+)\\.layer_norm_self_attn\\._extra_state$': 'transformer_blocks.\\1.norm1.norm._extra_state', '^net\\.blocks\\.(\\d+)\\.layer_norm_cross_attn\\._extra_state$': 'transformer_blocks.\\1.norm2.norm._extra_state', '^net\\.blocks\\.(\\d+)\\.layer_norm_mlp\\._extra_state$': 'transformer_blocks.\\1.norm3.norm._extra_state', '^net\\.final_layer\\.linear\\.(.*)$': 'final_layer.proj_out.\\1', '^net\\.final_layer\\.adaln_modulation\\.1\\.(.*)$': 'final_layer.linear_1.\\1', '^net\\.final_layer\\.adaln_modulation\\.2\\.(.*)$': 'final_layer.linear_2.\\1'})(), reverse_param_names_mapping: dict = dict(), lora_param_names_mapping: dict = (lambda: {'^transformer_blocks\\.(\\d+)\\.attn1\\.to_q\\.(.*)$': 'transformer_blocks.\\1.attn1.to_q.\\2', '^transformer_blocks\\.(\\d+)\\.attn1\\.to_k\\.(.*)$': 'transformer_blocks.\\1.attn1.to_k.\\2', '^transformer_blocks\\.(\\d+)\\.attn1\\.to_v\\.(.*)$': 'transformer_blocks.\\1.attn1.to_v.\\2', '^transformer_blocks\\.(\\d+)\\.attn1\\.to_out\\.(.*)$': 'transformer_blocks.\\1.attn1.to_out.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_q\\.(.*)$': 'transformer_blocks.\\1.attn2.to_q.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_k\\.(.*)$': 'transformer_blocks.\\1.attn2.to_k.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_v\\.(.*)$': 'transformer_blocks.\\1.attn2.to_v.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_out\\.(.*)$': 'transformer_blocks.\\1.attn2.to_out.\\2', '^transformer_blocks\\.(\\d+)\\.mlp\\.(.*)$': 'transformer_blocks.\\1.mlp.\\2'})(), cast_prompt_embeds_to_dit_dtype: bool = False, _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (SAGE_ATTN, FLASH_ATTN, TORCH_SDPA, VIDEO_SPARSE_ATTN, VMOBA_ATTN, SAGE_ATTN_THREE, SLA_ATTN, SAGE_SLA_ATTN), hidden_size: int = 0, num_attention_heads: int = 40, num_channels_latents: int = 0, in_channels: int = 16, out_channels: int = 16, exclude_lora_layers: list[str] = (lambda: ['embedder'])(), boundary_ratio: float | None = None, attention_head_dim: int = 128, num_layers: int = 36, mlp_ratio: float = 4.0, text_embed_dim: int = 1024, adaln_lora_dim: int = 256, use_adaln_lora: bool = True, max_size: tuple[int, int, int] = (128, 240, 240), patch_size: tuple[int, int, int] = (1, 2, 2), rope_scale: tuple[float, float, float] = (1.0, 3.0, 3.0), concat_padding_mask: bool = True, extra_pos_embed_type: str | None = None, use_crossattn_projection: bool = False, crossattn_proj_in_channels: int = 100352, rope_enable_fps_modulation: bool = True, qk_norm: str = 'rms_norm', eps: float = 1e-06)
fastvideo.configs.models.dits.cosmos2_5.Cosmos25_14BVideoConfig
dataclass
¶Cosmos25_14BVideoConfig(arch_config: DiTArchConfig = Cosmos25_14BArchConfig(), prefix: str = 'Cosmos25', quant_config: QuantizationConfig | None = None)
fastvideo.configs.models.dits.flux_2
¶
Classes¶
fastvideo.configs.models.dits.flux_2.Flux2ArchConfig
dataclass
¶Flux2ArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), _fsdp_shard_conditions: list = list(), _compile_conditions: list = list(), param_names_mapping: dict = (lambda: {'transformer\\.(\\w*)\\.(.*)$': '\\1.\\2'})(), reverse_param_names_mapping: dict = dict(), lora_param_names_mapping: dict = dict(), cast_prompt_embeds_to_dit_dtype: bool = True, _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (SAGE_ATTN, FLASH_ATTN, TORCH_SDPA, VIDEO_SPARSE_ATTN, VMOBA_ATTN, SAGE_ATTN_THREE, SLA_ATTN, SAGE_SLA_ATTN), hidden_size: int = 0, num_attention_heads: int = 24, num_channels_latents: int = 0, in_channels: int = 64, out_channels: int | None = None, exclude_lora_layers: list[str] = list(), boundary_ratio: float | None = None, patch_size: int = 1, num_layers: int = 19, num_single_layers: int = 38, attention_head_dim: int = 128, joint_attention_dim: int = 4096, timestep_guidance_channels: int = 256, mlp_ratio: float = 3.0, axes_dims_rope: tuple[int, ...] = (32, 32, 32, 32), rope_theta: int = 2000, eps: float = 1e-06, guidance_embeds: bool = True, ff_context_swiglu_fp32: bool = False)
Bases: DiTArchConfig
Architecture configuration for Flux2 transformer model.
fastvideo.configs.models.dits.flux_2.Flux2ArchConfig.update_from_weight_keys
¶Infer num_layers and num_single_layers from checkpoint weight keys so the model is built with the same number of blocks as the weights.
Source code in fastvideo/configs/models/dits/flux_2.py
fastvideo.configs.models.dits.flux_2.Flux2Config
dataclass
¶Flux2Config(arch_config: DiTArchConfig = Flux2ArchConfig(), prefix: str = 'Flux', quant_config: QuantizationConfig | None = None)
Functions:¶
fastvideo.configs.models.dits.gen3c
¶
Classes¶
fastvideo.configs.models.dits.gen3c.Gen3CArchConfig
dataclass
¶Gen3CArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), _fsdp_shard_conditions: list = (lambda: [is_transformer_blocks])(), _compile_conditions: list = list(), param_names_mapping: dict = (lambda: {'^net\\.x_embedder\\.proj\\.1\\.(.*)$': 'patch_embed.proj.\\1', '^net\\.t_embedder\\.0\\.(.*)$': 'time_embed.time_proj.\\1', '^net\\.t_embedder\\.1\\.linear_1\\.(.*)$': 'time_embed.t_embedder.linear_1.\\1', '^net\\.t_embedder\\.1\\.linear_2\\.(.*)$': 'time_embed.t_embedder.linear_2.\\1', '^net\\.augment_sigma_embedder\\.0\\.(.*)$': 'augment_sigma_embed.time_proj.\\1', '^net\\.augment_sigma_embedder\\.1\\.linear_1\\.(.*)$': 'augment_sigma_embed.t_embedder.linear_1.\\1', '^net\\.augment_sigma_embedder\\.1\\.linear_2\\.(.*)$': 'augment_sigma_embed.t_embedder.linear_2.\\1', '^net\\.affline_norm\\.(.*)$': 'affine_norm.\\1', '^net\\.extra_pos_embedder\\.pos_emb_t$': 'learnable_pos_embed.pos_emb_t', '^net\\.extra_pos_embedder\\.pos_emb_h$': 'learnable_pos_embed.pos_emb_h', '^net\\.extra_pos_embedder\\.pos_emb_w$': 'learnable_pos_embed.pos_emb_w', '^net\\.blocks\\.block(\\d+)\\.blocks\\.0\\.block\\.attn\\.to_q\\.0\\.(.*)$': 'transformer_blocks.\\1.attn1.to_q.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.0\\.block\\.attn\\.to_q\\.1\\.(.*)$': 'transformer_blocks.\\1.attn1.norm_q.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.0\\.block\\.attn\\.to_k\\.0\\.(.*)$': 'transformer_blocks.\\1.attn1.to_k.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.0\\.block\\.attn\\.to_k\\.1\\.(.*)$': 'transformer_blocks.\\1.attn1.norm_k.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.0\\.block\\.attn\\.to_v\\.0\\.(.*)$': 'transformer_blocks.\\1.attn1.to_v.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.0\\.block\\.attn\\.to_out\\.0\\.(.*)$': 'transformer_blocks.\\1.attn1.to_out.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.0\\.adaLN_modulation\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_self_attn.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.1\\.block\\.attn\\.to_q\\.0\\.(.*)$': 'transformer_blocks.\\1.attn2.to_q.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.1\\.block\\.attn\\.to_q\\.1\\.(.*)$': 'transformer_blocks.\\1.attn2.norm_q.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.1\\.block\\.attn\\.to_k\\.0\\.(.*)$': 'transformer_blocks.\\1.attn2.to_k.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.1\\.block\\.attn\\.to_k\\.1\\.(.*)$': 'transformer_blocks.\\1.attn2.norm_k.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.1\\.block\\.attn\\.to_v\\.0\\.(.*)$': 'transformer_blocks.\\1.attn2.to_v.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.1\\.block\\.attn\\.to_out\\.0\\.(.*)$': 'transformer_blocks.\\1.attn2.to_out.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.1\\.adaLN_modulation\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_cross_attn.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.2\\.block\\.layer1\\.(.*)$': 'transformer_blocks.\\1.mlp.fc_in.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.2\\.block\\.layer2\\.(.*)$': 'transformer_blocks.\\1.mlp.fc_out.\\2', '^net\\.blocks\\.block(\\d+)\\.blocks\\.2\\.adaLN_modulation\\.(.*)$': 'transformer_blocks.\\1.adaln_modulation_mlp.\\2', '^net\\.final_layer\\.linear\\.(.*)$': 'final_layer.proj_out.\\1', '^net\\.final_layer\\.adaLN_modulation\\.(.*)$': 'final_layer.adaln_modulation.\\1'})(), reverse_param_names_mapping: dict = dict(), lora_param_names_mapping: dict = (lambda: {'^transformer_blocks\\.(\\d+)\\.attn1\\.to_q\\.(.*)$': 'transformer_blocks.\\1.attn1.to_q.\\2', '^transformer_blocks\\.(\\d+)\\.attn1\\.to_k\\.(.*)$': 'transformer_blocks.\\1.attn1.to_k.\\2', '^transformer_blocks\\.(\\d+)\\.attn1\\.to_v\\.(.*)$': 'transformer_blocks.\\1.attn1.to_v.\\2', '^transformer_blocks\\.(\\d+)\\.attn1\\.to_out\\.(.*)$': 'transformer_blocks.\\1.attn1.to_out.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_q\\.(.*)$': 'transformer_blocks.\\1.attn2.to_q.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_k\\.(.*)$': 'transformer_blocks.\\1.attn2.to_k.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_v\\.(.*)$': 'transformer_blocks.\\1.attn2.to_v.\\2', '^transformer_blocks\\.(\\d+)\\.attn2\\.to_out\\.(.*)$': 'transformer_blocks.\\1.attn2.to_out.\\2', '^transformer_blocks\\.(\\d+)\\.mlp\\.(.*)$': 'transformer_blocks.\\1.mlp.\\2'})(), cast_prompt_embeds_to_dit_dtype: bool = False, _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (SAGE_ATTN, FLASH_ATTN, TORCH_SDPA, VIDEO_SPARSE_ATTN, VMOBA_ATTN, SAGE_ATTN_THREE, SLA_ATTN, SAGE_SLA_ATTN), hidden_size: int = 0, num_attention_heads: int = 32, num_channels_latents: int = 0, in_channels: int = 16, out_channels: int = 16, exclude_lora_layers: list[str] = (lambda: ['embedder'])(), boundary_ratio: float | None = None, CHANNELS_PER_BUFFER: int = 32, frame_buffer_max: int = 2, attention_head_dim: int = 128, num_layers: int = 28, mlp_ratio: float = 4.0, text_embed_dim: int = 1024, adaln_lora_dim: int = 256, use_adaln_lora: bool = True, add_augment_sigma_embedding: bool = False, max_size: tuple[int, int, int] = (128, 240, 240), patch_size: tuple[int, int, int] = (1, 2, 2), rope_scale: tuple[float, float, float] = (2.0, 1.0, 1.0), extra_pos_embed_type: str = 'learnable', concat_padding_mask: bool = True, use_crossattn_projection: bool = False, rope_enable_fps_modulation: bool = True, qk_norm: str = 'rms_norm', eps: float = 1e-06, affine_emb_norm: bool = True, block_x_format: str = 'THWBD')
Bases: DiTArchConfig
Configuration for GEN3C architecture (VideoExtendGeneralDIT).
fastvideo.configs.models.dits.gen3c.Gen3CVideoConfig
dataclass
¶Gen3CVideoConfig(arch_config: DiTArchConfig = Gen3CArchConfig(), prefix: str = 'Gen3C', quant_config: QuantizationConfig | None = None)
fastvideo.configs.models.dits.hunyuangamecraft
¶
Configuration for HunyuanGameCraft transformer model.
HunyuanGameCraft extends HunyuanVideo with: 1. CameraNet for camera/action conditioning 2. 33 input channels (16 latent + 16 gt_latent + 1 mask) 3. Mask-based conditioning for autoregressive generation
Classes¶
fastvideo.configs.models.dits.hunyuangamecraft.HunyuanGameCraftArchConfig
dataclass
¶HunyuanGameCraftArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), _fsdp_shard_conditions: list = (lambda: [is_double_block, is_single_block, is_refiner_block, is_camera_net])(), _compile_conditions: list = (lambda: [is_double_block, is_single_block, is_txt_in])(), param_names_mapping: dict = (lambda: {'^(.*)\\.img_mlp\\.fc1\\.(.*)$': '\\1.img_mlp.fc_in.\\2', '^(.*)\\.img_mlp\\.fc2\\.(.*)$': '\\1.img_mlp.fc_out.\\2', '^(.*)\\.txt_mlp\\.fc1\\.(.*)$': '\\1.txt_mlp.fc_in.\\2', '^(.*)\\.txt_mlp\\.fc2\\.(.*)$': '\\1.txt_mlp.fc_out.\\2', '^single_blocks\\.(\\d+)\\.mlp\\.fc1\\.(.*)$': 'single_blocks.\\1.mlp.fc_in.\\2', '^single_blocks\\.(\\d+)\\.mlp\\.fc2\\.(.*)$': 'single_blocks.\\1.mlp.fc_out.\\2', '^txt_in\\.individual_token_refiner\\.blocks\\.(\\d+)\\.(.*)$': 'txt_in.refiner_blocks.\\1.\\2', '^vector_in\\.in_layer\\.(.*)$': 'vector_in.fc_in.\\1', '^vector_in\\.out_layer\\.(.*)$': 'vector_in.fc_out.\\1', '^time_in\\.mlp\\.0\\.(.*)$': 'time_in.mlp.fc_in.\\1', '^time_in\\.mlp\\.2\\.(.*)$': 'time_in.mlp.fc_out.\\1', '^guidance_in\\.mlp\\.0\\.(.*)$': 'guidance_in.mlp.fc_in.\\1', '^guidance_in\\.mlp\\.2\\.(.*)$': 'guidance_in.mlp.fc_out.\\1', '^final_layer\\.adaLN_modulation\\.1\\.(.*)$': 'final_layer.adaLN_modulation.linear.\\1', '^txt_in\\.refiner_blocks\\.(\\d+)\\.mlp\\.fc1\\.(.*)$': 'txt_in.refiner_blocks.\\1.mlp.fc_in.\\2', '^txt_in\\.refiner_blocks\\.(\\d+)\\.mlp\\.fc2\\.(.*)$': 'txt_in.refiner_blocks.\\1.mlp.fc_out.\\2'})(), reverse_param_names_mapping: dict = (lambda: {})(), lora_param_names_mapping: dict = dict(), cast_prompt_embeds_to_dit_dtype: bool = False, _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (SAGE_ATTN, FLASH_ATTN, TORCH_SDPA, VIDEO_SPARSE_ATTN, VMOBA_ATTN, SAGE_ATTN_THREE, SLA_ATTN, SAGE_SLA_ATTN), hidden_size: int = 0, num_attention_heads: int = 24, num_channels_latents: int = 0, in_channels: int = 33, out_channels: int = 16, exclude_lora_layers: list[str] = (lambda: ['img_in', 'txt_in', 'time_in', 'vector_in', 'camera_net'])(), boundary_ratio: float | None = None, _fastvideo_version: str = '0.1.0', camera_net: bool = True, patch_size: int | tuple[int, int, int] = 2, patch_size_t: int = 1, attention_head_dim: int = 128, mlp_ratio: float = 4.0, num_layers: int = 20, num_single_layers: int = 40, num_refiner_layers: int = 2, rope_axes_dim: tuple[int, int, int] = (16, 56, 56), guidance_embeds: bool = False, dtype: dtype | None = None, text_embed_dim: int = 4096, pooled_projection_dim: int = 768, rope_theta: int = 256, qk_norm: str = 'rms_norm', camera_in_channels: int = 6, camera_downscale_coef: int = 8, camera_out_channels: int = 16)
Bases: DiTArchConfig
Architecture config for HunyuanGameCraft transformer.
fastvideo.configs.models.dits.hunyuangamecraft.HunyuanGameCraftConfig
dataclass
¶HunyuanGameCraftConfig(arch_config: DiTArchConfig = HunyuanGameCraftArchConfig(), prefix: str = 'HunyuanGameCraft', quant_config: QuantizationConfig | None = None)
fastvideo.configs.models.dits.longcat
¶
LongCat Video DiT configuration for native FastVideo implementation.
Classes¶
fastvideo.configs.models.dits.longcat.LongCatVideoArchConfig
dataclass
¶LongCatVideoArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), _fsdp_shard_conditions: list = (lambda: [is_longcat_blocks])(), _compile_conditions: list = (lambda: [is_longcat_blocks])(), param_names_mapping: dict = (lambda: {'^x_embedder\\.(.*)$': 'patch_embed.\\1', '^t_embedder\\.mlp\\.0\\.(.*)$': 'time_embedder.linear_1.\\1', '^t_embedder\\.mlp\\.2\\.(.*)$': 'time_embedder.linear_2.\\1', '^y_embedder\\.y_proj\\.0\\.(.*)$': 'caption_embedder.linear_1.\\1', '^y_embedder\\.y_proj\\.2\\.(.*)$': 'caption_embedder.linear_2.\\1', '^blocks\\.(\\d+)\\.adaLN_modulation\\.1\\.(.*)$': 'blocks.\\1.adaln_linear_1.\\2', '^blocks\\.(\\d+)\\.mod_norm_attn\\.(.*)$': 'blocks.\\1.norm_attn.\\2', '^blocks\\.(\\d+)\\.mod_norm_ffn\\.(.*)$': 'blocks.\\1.norm_ffn.\\2', '^blocks\\.(\\d+)\\.pre_crs_attn_norm\\.(.*)$': 'blocks.\\1.norm_cross.\\2', '^blocks\\.(\\d+)\\.attn\\.qkv\\.(.*)$': 'blocks.\\1.self_attn.qkv_fused.\\2', '^blocks\\.(\\d+)\\.attn\\.proj\\.(.*)$': 'blocks.\\1.self_attn.to_out.\\2', '^blocks\\.(\\d+)\\.attn\\.q_norm\\.(.*)$': 'blocks.\\1.self_attn.q_norm.\\2', '^blocks\\.(\\d+)\\.attn\\.k_norm\\.(.*)$': 'blocks.\\1.self_attn.k_norm.\\2', '^blocks\\.(\\d+)\\.cross_attn\\.q_linear\\.(.*)$': 'blocks.\\1.cross_attn.to_q.\\2', '^blocks\\.(\\d+)\\.cross_attn\\.kv_linear\\.(.*)$': 'blocks.\\1.cross_attn.kv_fused.\\2', '^blocks\\.(\\d+)\\.cross_attn\\.proj\\.(.*)$': 'blocks.\\1.cross_attn.to_out.\\2', '^blocks\\.(\\d+)\\.cross_attn\\.q_norm\\.(.*)$': 'blocks.\\1.cross_attn.q_norm.\\2', '^blocks\\.(\\d+)\\.cross_attn\\.k_norm\\.(.*)$': 'blocks.\\1.cross_attn.k_norm.\\2', '^blocks\\.(\\d+)\\.ffn\\.w1\\.(.*)$': 'blocks.\\1.ffn.w1.\\2', '^blocks\\.(\\d+)\\.ffn\\.w2\\.(.*)$': 'blocks.\\1.ffn.w2.\\2', '^blocks\\.(\\d+)\\.ffn\\.w3\\.(.*)$': 'blocks.\\1.ffn.w3.\\2', '^final_layer\\.adaLN_modulation\\.1\\.(.*)$': 'final_layer.adaln_linear.\\1', '^final_layer\\.norm_final\\.(.*)$': 'final_layer.norm.\\1', '^final_layer\\.linear\\.(.*)$': 'final_layer.proj.\\1'})(), reverse_param_names_mapping: dict = (lambda: {})(), lora_param_names_mapping: dict = (lambda: {})(), cast_prompt_embeds_to_dit_dtype: bool = False, _supported_attention_backends: tuple = (lambda: (FLASH_ATTN, TORCH_SDPA))(), hidden_size: int = 4096, num_attention_heads: int = 32, num_channels_latents: int = 16, in_channels: int = 16, out_channels: int = 16, exclude_lora_layers: list[str] = (lambda: [])(), boundary_ratio: float | None = None, depth: int = 48, attention_head_dim: int = 128, patch_size: tuple[int, int, int] = (1, 2, 2), caption_channels: int = 4096, adaln_tembed_dim: int = 512, frequency_embedding_size: int = 256, mlp_ratio: int = 4, text_tokens_zero_pad: bool = True, enable_bsa: bool = False, bsa_params: dict | None = (lambda: {'sparsity': 0.9375, 'cdf_threshold': None, 'chunk_3d_shape_q': [4, 4, 4], 'chunk_3d_shape_k': [4, 4, 4]})())
Bases: DiTArchConfig
Architecture configuration for native LongCat Video DiT.
fastvideo.configs.models.dits.longcat.LongCatVideoConfig
dataclass
¶LongCatVideoConfig(arch_config: DiTArchConfig = LongCatVideoArchConfig(), prefix: str = 'longcat', quant_config: QuantizationConfig | None = None)
Functions:¶
fastvideo.configs.models.dits.longcat.is_longcat_blocks
¶
fastvideo.configs.models.dits.ltx2
¶
LTX-2 Transformer configuration for native FastVideo integration.
Classes¶
fastvideo.configs.models.dits.ltx2.LTX2VideoArchConfig
dataclass
¶LTX2VideoArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), _fsdp_shard_conditions: list = (lambda: [is_ltx2_blocks])(), _compile_conditions: list = (lambda: [is_ltx2_blocks])(), param_names_mapping: dict = (lambda: {'^model\\.diffusion_model\\.(.*)$': 'model.\\1', '^diffusion_model\\.(.*)$': 'model.\\1', '^model\\.(.*)$': 'model.\\1', '^(.*)$': 'model.\\1'})(), reverse_param_names_mapping: dict = (lambda: {})(), lora_param_names_mapping: dict = (lambda: {})(), cast_prompt_embeds_to_dit_dtype: bool = False, _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (SAGE_ATTN, FLASH_ATTN, TORCH_SDPA, VIDEO_SPARSE_ATTN, VMOBA_ATTN, SAGE_ATTN_THREE, SLA_ATTN, SAGE_SLA_ATTN), hidden_size: int = 0, num_attention_heads: int = 32, num_channels_latents: int = 128, in_channels: int | None = None, out_channels: int | None = None, exclude_lora_layers: list[str] = list(), boundary_ratio: float | None = None, attention_head_dim: int = 128, num_layers: int = 48, cross_attention_dim: int = 4096, caption_channels: int = 3840, norm_eps: float = 1e-06, attention_type: str = 'default', rope_type: str = 'split', double_precision_rope: bool = True, cross_attention_adaln: bool = False, caption_proj_before_connector: bool = False, positional_embedding_theta: float = 10000.0, positional_embedding_max_pos: list[int] = (lambda: [20, 2048, 2048])(), timestep_scale_multiplier: int = 1000, use_middle_indices_grid: bool = True, patch_size: tuple[int, int, int] = (1, 1, 1), audio_num_attention_heads: int = 32, audio_attention_head_dim: int = 64, audio_in_channels: int = 128, audio_out_channels: int = 128, audio_cross_attention_dim: int = 2048, audio_positional_embedding_max_pos: list[int] = (lambda: [20])(), av_ca_timestep_scale_multiplier: int = 1, apply_gated_attention: bool = False, caption_projection_first_linear: bool = True, caption_proj_input_norm: bool = True, caption_projection_second_linear: bool = True, connector_num_attention_heads: int = 30, connector_attention_head_dim: int = 128, connector_num_layers: int = 2, audio_connector_num_attention_heads: int = 30, audio_connector_attention_head_dim: int = 128, audio_connector_num_layers: int = 2, stg_block_idx: int | None = None)
Bases: DiTArchConfig
Architecture configuration for LTX-2 video transformer.
fastvideo.configs.models.dits.ltx2.LTX2VideoConfig
dataclass
¶LTX2VideoConfig(arch_config: DiTArchConfig = LTX2VideoArchConfig(), prefix: str = 'ltx2', quant_config: QuantizationConfig | None = None)
fastvideo.configs.models.dits.magi_human
¶
Architecture / model config for the daVinci-MagiHuman DiT.
The MagiHuman base DiT is a 15B-parameter single-stream transformer that jointly denoises video, audio, and text tokens in one flat sequence. Layout details verified against GAIR/daVinci-MagiHuman's base/ shards (2026-04-24).
This file captures only configuration. The module implementation lives in fastvideo/models/dits/magi_human.py and the pipeline wiring in fastvideo/pipelines/basic/magi_human/.
Classes¶
fastvideo.configs.models.dits.magi_human.MagiHumanArchConfig
dataclass
¶MagiHumanArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), _fsdp_shard_conditions: list = (lambda: [_is_block_layer])(), _compile_conditions: list = list(), param_names_mapping: dict = dict(), reverse_param_names_mapping: dict = dict(), lora_param_names_mapping: dict = dict(), cast_prompt_embeds_to_dit_dtype: bool = False, _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (SAGE_ATTN, FLASH_ATTN, TORCH_SDPA, VIDEO_SPARSE_ATTN, VMOBA_ATTN, SAGE_ATTN_THREE, SLA_ATTN, SAGE_SLA_ATTN), hidden_size: int = 5120, num_attention_heads: int = 0, num_channels_latents: int = 0, in_channels: int = 0, out_channels: int = 0, exclude_lora_layers: list[str] = list(), boundary_ratio: float | None = None, num_layers: int = 40, head_dim: int = 128, num_query_groups: int = 8, video_in_channels: int = 192, audio_in_channels: int = 64, text_in_channels: int = 3584, mm_layers: tuple[int, ...] = (0, 1, 2, 3, 36, 37, 38, 39), local_attn_layers: tuple[int, ...] = (), gelu7_layers: tuple[int, ...] = (0, 1, 2, 3), post_norm_layers: tuple[int, ...] = (), enable_attn_gating: bool = True, activation_type: str = 'swiglu7', patch_size: tuple[int, int, int] = (1, 2, 2), spatial_rope_interpolation: str = 'extra', tread_selection_rate: float = 0.5, tread_start_layer_idx: int = 2, tread_end_layer_idx: int = 25, num_heads_kv: int = 0)
Bases: DiTArchConfig
MagiHuman base DiT architecture constants.
Scope contract: fields here must match the transformer/config.json
emitted by scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py
1:1, and both are sourced from the upstream Python reference
inference/common/config.py::ModelConfig (the HF root config.json
is empty so the Python source is canonical). Pipeline-level knobs
(VAE stride, fps, num_inference_steps, CFG scales, flow_shift,
t5_gemma_target_length) and data-proxy knobs (coords_style,
frame_receptive_field, ref_audio_offset, text_offset) live on
MagiHumanBaseConfig, NOT here.
param_names_mapping is intentionally empty: the FastVideo implementation
keeps the same module tree as the reference (adapter.*,
block.layers.<i>.*, final_linear_{video,audio}.*,
final_norm_{video,audio}.*), so converted weights load directly.
fastvideo.configs.models.dits.stable_audio
¶
Config for the Stable Audio Open 1.0 DiT.
Note: the SA pipeline bypasses the standard ComposedPipelineBase
component loader because the published HF repo ships a single monolithic
model.safetensors (no Diffusers-style model_index.json or
per-subfolder layout). The arch fields and param_names_mapping here
document the architecture and key remap so the same conventions used by
the rest of the DiT family apply (FSDP shard conditions, supported
attention backends, future loader integrations) — they are not currently
consumed by fastvideo/models/loader/fsdp_load.py for SA.
Classes¶
fastvideo.configs.models.encoders
¶
Classes¶
fastvideo.configs.models.encoders.Mistral3TextConfig
dataclass
¶
Mistral3TextConfig(arch_config: TextEncoderArchConfig = Mistral3TextArchConfig(), prefix: str = 'mistral3', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = True, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
Top-level config for the Mistral3 full Flux2 text encoder.
fastvideo.configs.models.encoders.Qwen3TextConfig
dataclass
¶
Qwen3TextConfig(arch_config: TextEncoderArchConfig = Qwen3TextArchConfig(), prefix: str = 'qwen3', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = True, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
Top-level config for Qwen3 text encoder.
fastvideo.configs.models.encoders.Reason1ArchConfig
dataclass
¶
Reason1ArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), architectures: list[str] = (lambda: ['Qwen2_5_VLForConditionalGeneration'])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = True, use_return_dict: bool = True, vocab_size: int = 152064, hidden_size: int = 3584, num_hidden_layers: int = 28, num_attention_heads: int = 28, pad_token_id: int = 151643, eos_token_id: int = 151645, text_len: int = 512, hidden_state_skip_layer: int = 0, decoder_start_token_id: int = 0, output_past: bool = True, scalable_attention: bool = True, tie_word_embeddings: bool = False, tokenizer_kwargs: dict[str, Any] = dict(), _fsdp_shard_conditions: list = (lambda: [])(), require_processor: bool = False, model_type: str = 'qwen2_5_vl', num_key_value_heads: int = 4, intermediate_size: int = 18944, bos_token_id: int = 151643, image_token_id: int = 151655, video_token_id: int = 151656, vision_token_id: int = 151654, vision_start_token_id: int = 151652, vision_end_token_id: int = 151653, vision_config: dict[str, Any] | None = None, rope_theta: float = 1000000.0, rope_scaling: dict[str, Any] | None = (lambda: {'type': 'mrope', 'mrope_section': [16, 24, 24]})(), max_position_embeddings: int = 128000, max_window_layers: int = 28, embedding_concat_strategy: str = 'mean_pooling', n_layers_per_group: int = 5, num_embedding_padding_tokens: int = 512, attention_dropout: float = 0.0, hidden_act: str = 'silu', initializer_range: float = 0.02, rms_norm_eps: float = 1e-06, use_sliding_window: bool = False, sliding_window: int = 32768, use_cache: bool = False, torch_dtype: str = 'bfloat16', _attn_implementation: str = 'flash_attention_2')
Bases: TextEncoderArchConfig
Architecture settings (defaults match Qwen2.5-VL-7B-Instruct).
fastvideo.configs.models.encoders.Reason1Config
dataclass
¶
Reason1Config(arch_config: Reason1ArchConfig = Reason1ArchConfig(), prefix: str = '', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = False, treat_empty_as_dot: bool = False, tokenizer_type: str = 'Qwen/Qwen2.5-VL-7B-Instruct')
Bases: TextEncoderConfig
Reason1 text encoder config.
fastvideo.configs.models.encoders.SiglipVisionConfig
dataclass
¶
SiglipVisionConfig(arch_config: ImageEncoderArchConfig = SiglipVisionArchConfig(), prefix: str = 'siglip', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, num_hidden_layers_override: int | None = None, require_post_norm: bool | None = None, enable_scale: bool = True, is_causal: bool = False)
Bases: ImageEncoderConfig
Configuration for SigLIP vision encoder.
fastvideo.configs.models.encoders.T5LargeConfig
dataclass
¶
T5LargeConfig(arch_config: TextEncoderArchConfig = T5LargeArchConfig(), prefix: str = 't5', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = False, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
T5 Large configuration for your specific model.
Modules¶
fastvideo.configs.models.encoders.mistral3
¶
Mistral3 text encoder configuration for full Flux2.
Classes¶
fastvideo.configs.models.encoders.mistral3.Mistral3TextArchConfig
dataclass
¶Mistral3TextArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), architectures: list[str] = (lambda: ['Mistral3ForConditionalGeneration'])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = True, use_return_dict: bool = True, vocab_size: int = 0, hidden_size: int = 5120, num_hidden_layers: int = 40, num_attention_heads: int = 0, pad_token_id: int = 0, eos_token_id: int = 0, text_len: int = 512, hidden_state_skip_layer: int = 0, decoder_start_token_id: int = 0, output_past: bool = True, scalable_attention: bool = True, tie_word_embeddings: bool = False, tokenizer_kwargs: dict[str, Any] = dict(), _fsdp_shard_conditions: list = (lambda: [])(), require_processor: bool = True)
Bases: TextEncoderArchConfig
Architecture config for the Mistral3 text encoder used by full Flux2.
fastvideo.configs.models.encoders.mistral3.Mistral3TextConfig
dataclass
¶Mistral3TextConfig(arch_config: TextEncoderArchConfig = Mistral3TextArchConfig(), prefix: str = 'mistral3', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = True, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
Top-level config for the Mistral3 full Flux2 text encoder.
fastvideo.configs.models.encoders.qwen3
¶
Qwen3 text encoder configuration for FastVideo diffusion models (e.g. Flux2 Klein).
Classes¶
fastvideo.configs.models.encoders.qwen3.Qwen3TextArchConfig
dataclass
¶Qwen3TextArchConfig(stacked_params_mapping: list[tuple[str, str, str | int]] = (lambda: [('.qkv_proj', '.q_proj', 'q'), ('.qkv_proj', '.k_proj', 'k'), ('.qkv_proj', '.v_proj', 'v'), ('.gate_up_proj', '.gate_proj', 0), ('.gate_up_proj', '.up_proj', 1)])(), architectures: list[str] = (lambda: [])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = True, use_return_dict: bool = True, vocab_size: int = 151936, hidden_size: int = 2560, num_hidden_layers: int = 36, num_attention_heads: int = 32, pad_token_id: int = 151643, eos_token_id: int = 151645, text_len: int = 512, hidden_state_skip_layer: int = 0, decoder_start_token_id: int = 0, output_past: bool = True, scalable_attention: bool = True, tie_word_embeddings: bool = True, tokenizer_kwargs: dict[str, Any] = dict(), _fsdp_shard_conditions: list = (lambda: [_is_transformer_layer, _is_embeddings, _is_final_norm])(), require_processor: bool = False, intermediate_size: int = 9728, num_key_value_heads: int = 8, hidden_act: str = 'silu', max_position_embeddings: int = 40960, initializer_range: float = 0.02, rms_norm_eps: float = 1e-06, use_cache: bool = True, bos_token_id: int = 151643, rope_theta: float = 1000000.0, rope_scaling: dict | None = None, attention_bias: bool = False, attention_dropout: float = 0.0, mlp_bias: bool = False, head_dim: int = 128)
Bases: TextEncoderArchConfig
Architecture config for Qwen3 text encoder.
Qwen3 is similar to LLaMA but with QK-Norm (RMSNorm on Q and K before attention). Used by Flux2 Klein.
fastvideo.configs.models.encoders.qwen3.Qwen3TextConfig
dataclass
¶Qwen3TextConfig(arch_config: TextEncoderArchConfig = Qwen3TextArchConfig(), prefix: str = 'qwen3', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = True, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
Top-level config for Qwen3 text encoder.
fastvideo.configs.models.encoders.reason1
¶
Config for Reason1 (Qwen2.5-VL) text encoder.
Classes¶
fastvideo.configs.models.encoders.reason1.Reason1ArchConfig
dataclass
¶Reason1ArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), architectures: list[str] = (lambda: ['Qwen2_5_VLForConditionalGeneration'])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = True, use_return_dict: bool = True, vocab_size: int = 152064, hidden_size: int = 3584, num_hidden_layers: int = 28, num_attention_heads: int = 28, pad_token_id: int = 151643, eos_token_id: int = 151645, text_len: int = 512, hidden_state_skip_layer: int = 0, decoder_start_token_id: int = 0, output_past: bool = True, scalable_attention: bool = True, tie_word_embeddings: bool = False, tokenizer_kwargs: dict[str, Any] = dict(), _fsdp_shard_conditions: list = (lambda: [])(), require_processor: bool = False, model_type: str = 'qwen2_5_vl', num_key_value_heads: int = 4, intermediate_size: int = 18944, bos_token_id: int = 151643, image_token_id: int = 151655, video_token_id: int = 151656, vision_token_id: int = 151654, vision_start_token_id: int = 151652, vision_end_token_id: int = 151653, vision_config: dict[str, Any] | None = None, rope_theta: float = 1000000.0, rope_scaling: dict[str, Any] | None = (lambda: {'type': 'mrope', 'mrope_section': [16, 24, 24]})(), max_position_embeddings: int = 128000, max_window_layers: int = 28, embedding_concat_strategy: str = 'mean_pooling', n_layers_per_group: int = 5, num_embedding_padding_tokens: int = 512, attention_dropout: float = 0.0, hidden_act: str = 'silu', initializer_range: float = 0.02, rms_norm_eps: float = 1e-06, use_sliding_window: bool = False, sliding_window: int = 32768, use_cache: bool = False, torch_dtype: str = 'bfloat16', _attn_implementation: str = 'flash_attention_2')
Bases: TextEncoderArchConfig
Architecture settings (defaults match Qwen2.5-VL-7B-Instruct).
fastvideo.configs.models.encoders.reason1.Reason1Config
dataclass
¶Reason1Config(arch_config: Reason1ArchConfig = Reason1ArchConfig(), prefix: str = '', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = False, treat_empty_as_dot: bool = False, tokenizer_type: str = 'Qwen/Qwen2.5-VL-7B-Instruct')
Bases: TextEncoderConfig
Reason1 text encoder config.
fastvideo.configs.models.encoders.siglip
¶
SigLIP vision encoder configuration for FastVideo.
Classes¶
fastvideo.configs.models.encoders.siglip.SiglipVisionArchConfig
dataclass
¶SiglipVisionArchConfig(stacked_params_mapping: list = (lambda: [('qkv_proj', 'q_proj', 'q'), ('qkv_proj', 'k_proj', 'k'), ('qkv_proj', 'v_proj', 'v')])(), architectures: list[str] = (lambda: ['SiglipVisionModel'])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = False, use_return_dict: bool = True, attention_dropout: float = 0.0, dtype: str | None = None, hidden_act: str = 'gelu_pytorch_tanh', hidden_size: int = 1152, image_size: int = 384, intermediate_size: int = 4304, layer_norm_eps: float = 1e-06, model_type: str = 'siglip_vision_model', num_attention_heads: int = 16, num_channels: int = 3, num_hidden_layers: int = 27, patch_size: int = 14)
Bases: ImageEncoderArchConfig
Architecture configuration for SigLIP vision encoder.
Fields match the config.json from HuggingFace SigLIP checkpoints.
fastvideo.configs.models.encoders.siglip.SiglipVisionConfig
dataclass
¶SiglipVisionConfig(arch_config: ImageEncoderArchConfig = SiglipVisionArchConfig(), prefix: str = 'siglip', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, num_hidden_layers_override: int | None = None, require_post_norm: bool | None = None, enable_scale: bool = True, is_causal: bool = False)
Bases: ImageEncoderConfig
Configuration for SigLIP vision encoder.
fastvideo.configs.models.encoders.stable_audio_conditioner
¶
Config for the Stable Audio Open 1.0 multi-conditioner.
The conditioner bundles three sub-conditioners — a T5 text encoder
(prompt) and two NumberConditioners (seconds_start / seconds_total)
— into the (cross_attn_cond, cross_attn_mask, global_embed) triple the
DiT consumes. The architecture is fully specified by the official
stable_audio_tools MultiConditioner config; the constants here
mirror that.
fastvideo.configs.models.encoders.t5
¶
Classes¶
fastvideo.configs.models.encoders.t5.T5LargeArchConfig
dataclass
¶T5LargeArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = (lambda: [('.qkv_proj', '.q', 'q'), ('.qkv_proj', '.k', 'k'), ('.qkv_proj', '.v', 'v')])(), architectures: list[str] = (lambda: [])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = False, use_return_dict: bool = True, vocab_size: int = 32128, hidden_size: int = 0, num_hidden_layers: int = 0, num_attention_heads: int = 0, pad_token_id: int = 0, eos_token_id: int = 1, text_len: int = 512, hidden_state_skip_layer: int = 0, decoder_start_token_id: int = 0, output_past: bool = True, scalable_attention: bool = True, tie_word_embeddings: bool = False, tokenizer_kwargs: dict[str, Any] = dict(), _fsdp_shard_conditions: list = (lambda: [_is_transformer_layer, _is_embeddings, _is_final_layernorm])(), require_processor: bool = False, d_model: int = 1024, d_kv: int = 128, d_ff: int = 65536, num_layers: int = 24, num_decoder_layers: int | None = 24, num_heads: int = 128, relative_attention_num_buckets: int = 32, relative_attention_max_distance: int = 128, dropout_rate: float = 0.1, layer_norm_epsilon: float = 1e-06, initializer_factor: float = 1.0, feed_forward_proj: str = 'relu', dense_act_fn: str = '', is_gated_act: bool = False, is_encoder_decoder: bool = True, use_cache: bool = True, classifier_dropout: float = 0.0, dtype: str | None = None, gradient_checkpointing: bool = False, n_positions: int = 512, task_specific_params: dict | None = None)
Bases: T5ArchConfig
T5 Large architecture config with parameters for your specific model.
fastvideo.configs.models.encoders.t5.T5LargeConfig
dataclass
¶T5LargeConfig(arch_config: TextEncoderArchConfig = T5LargeArchConfig(), prefix: str = 't5', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = False, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
T5 Large configuration for your specific model.
fastvideo.configs.models.encoders.t5gemma
¶
Config for the T5-Gemma encoder used by daVinci-MagiHuman.
The reference pipeline uses transformers.models.t5gemma.T5GemmaEncoderModel
on google/t5gemma-9b-9b-ul2. That is a gated Google repository, so the
encoder weights are not bundled inside GAIR/daVinci-MagiHuman; they are
loaded from the T5-Gemma HF repo directly.
Encoder shape (verified from google/t5gemma-9b-9b-ul2/config.json): layers=42, hidden=3584, heads=16, kv_heads=8, head_dim=256, intermediate=14336, rope_theta=10000.0, max_pos=8192, layer_types alternate sliding_attention / full_attention.
fastvideo.configs.models.vaes
¶
Classes¶
fastvideo.configs.models.vaes.Cosmos25VAEConfig
dataclass
¶
Cosmos25VAEConfig(arch_config: Cosmos25VAEArchConfig = Cosmos25VAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = False, use_temporal_tiling: bool = False, use_parallel_tiling: bool = False, use_temporal_scaling_frames: bool = True, use_feature_cache: bool = True)
fastvideo.configs.models.vaes.Flux2VAEConfig
dataclass
¶
Flux2VAEConfig(arch_config: Flux2VAEArchConfig = Flux2VAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = False, use_temporal_tiling: bool = False, use_parallel_tiling: bool = False, use_temporal_scaling_frames: bool = True)
fastvideo.configs.models.vaes.GameCraftVAEConfig
dataclass
¶
GameCraftVAEConfig(arch_config: VAEArchConfig = GameCraftVAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = True, use_temporal_tiling: bool = True, use_parallel_tiling: bool = True, use_temporal_scaling_frames: bool = True)
fastvideo.configs.models.vaes.Gen3CVAEConfig
dataclass
¶
Gen3CVAEConfig(arch_config: CosmosVAEArchConfig = CosmosVAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = False, use_temporal_tiling: bool = False, use_parallel_tiling: bool = False, use_temporal_scaling_frames: bool = True, use_feature_cache: bool = True)
Bases: CosmosVAEConfig
GEN3C VAE config placeholder.
GEN3C uses tokenizer-backed VAE loading logic at runtime, but we keep a model-specific config class so pipeline/model configs stay model-scoped.
fastvideo.configs.models.vaes.OobleckVAEArchConfig
dataclass
¶
OobleckVAEArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), scaling_factor: float | Tensor = 0, temporal_compression_ratio: int = 4, spatial_compression_ratio: int = 8, architectures: list[str] = (lambda: ['AutoencoderOobleck'])(), encoder_hidden_size: int = 128, downsampling_ratios: list[int] = (lambda: [2, 4, 4, 8, 8])(), channel_multiples: list[int] = (lambda: [1, 2, 4, 8, 16])(), decoder_channels: int = 128, decoder_input_channels: int = 64, audio_channels: int = 2, sampling_rate: int = 44100)
Bases: VAEArchConfig
Stable Audio Open 1.0 VAE architecture constants.
fastvideo.configs.models.vaes.OobleckVAEConfig
dataclass
¶
OobleckVAEConfig(arch_config: VAEArchConfig = OobleckVAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = False, use_temporal_tiling: bool = False, use_parallel_tiling: bool = False, use_temporal_scaling_frames: bool = True, pretrained_path: str = 'stabilityai/stable-audio-open-1.0', pretrained_subfolder: str = 'vae', pretrained_dtype: str = 'float16')
Bases: VAEConfig
FastVideo VAE config wrapping the Oobleck arch.
Audio VAEs don't use the temporal/spatial tiling defaults that the base VAEConfig is shaped for (those exist for video VAEs); they are retained but irrelevant for audio.
Modules¶
fastvideo.configs.models.vaes.base
¶
Classes¶
fastvideo.configs.models.vaes.base.VAEConfig
dataclass
¶VAEConfig(arch_config: VAEArchConfig = VAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = True, use_temporal_tiling: bool = True, use_parallel_tiling: bool = True, use_temporal_scaling_frames: bool = True)
Bases: ModelConfig
fastvideo.configs.models.vaes.base.VAEConfig.add_cli_args
staticmethod
¶Add CLI arguments for VAEConfig fields
Source code in fastvideo/configs/models/vaes/base.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
fastvideo.configs.models.vaes.cosmos2_5vae
¶
Cosmos 2.5 (Wan2.1-style) VAE config and checkpoint-key mapping.
Classes¶
fastvideo.configs.models.vaes.cosmos2_5vae.Cosmos25VAEArchConfig
dataclass
¶Cosmos25VAEArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), scaling_factor: float | Tensor = 0, temporal_compression_ratio: int = 4, spatial_compression_ratio: int = 8, _name_or_path: str = '', base_dim: int = 96, decoder_base_dim: int | None = None, z_dim: int = 16, dim_mult: tuple[int, ...] = (1, 2, 4, 4), num_res_blocks: int = 2, attn_scales: tuple[float, ...] = (), temperal_downsample: tuple[bool, ...] = (False, True, True), dropout: float = 0.0, is_residual: bool = False, in_channels: int = 3, out_channels: int = 3, patch_size: int | None = None, scale_factor_temporal: int = 4, scale_factor_spatial: int = 8, clip_output: bool = True, latents_mean: tuple[float, ...] = (-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921), latents_std: tuple[float, ...] = (2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916), param_names_mapping: dict[str, str] = (lambda: {'^conv1\\.(.*)$': 'quant_conv.\\1', '^conv2\\.(.*)$': 'post_quant_conv.\\1', '^encoder\\.conv1\\.(.*)$': 'encoder.conv_in.\\1', '^decoder\\.conv1\\.(.*)$': 'decoder.conv_in.\\1', '^encoder\\.head\\.0\\.gamma$': 'encoder.norm_out.gamma', '^encoder\\.head\\.2\\.(.*)$': 'encoder.conv_out.\\1', '^decoder\\.head\\.0\\.gamma$': 'decoder.norm_out.gamma', '^decoder\\.head\\.2\\.(.*)$': 'decoder.conv_out.\\1'})())
Bases: VAEArchConfig
fastvideo.configs.models.vaes.cosmos2_5vae.Cosmos25VAEArchConfig.map_official_key
staticmethod
¶Map a single official checkpoint key into FastVideo key space.
Source code in fastvideo/configs/models/vaes/cosmos2_5vae.py
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 | |
fastvideo.configs.models.vaes.cosmos2_5vae.Cosmos25VAEConfig
dataclass
¶Cosmos25VAEConfig(arch_config: Cosmos25VAEArchConfig = Cosmos25VAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = False, use_temporal_tiling: bool = False, use_parallel_tiling: bool = False, use_temporal_scaling_frames: bool = True, use_feature_cache: bool = True)
fastvideo.configs.models.vaes.flux2vae
¶
Classes¶
fastvideo.configs.models.vaes.flux2vae.Flux2VAEArchConfig
dataclass
¶Flux2VAEArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), scaling_factor: float = 0.13025, temporal_compression_ratio: int = 1, spatial_compression_ratio: int = 8, in_channels: int = 3, out_channels: int = 3, down_block_types: tuple[str, ...] = ('DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'AttnDownEncoderBlock2D'), up_block_types: tuple[str, ...] = ('AttnUpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D'), block_out_channels: tuple[int, ...] = (128, 256, 512, 512), layers_per_block: int = 2, act_fn: str = 'silu', latent_channels: int = 16, norm_num_groups: int = 32, sample_size: int = 512, force_upcast: bool = False, use_quant_conv: bool = True, use_post_quant_conv: bool = True, mid_block_add_attention: bool = True, batch_norm_eps: float = 1e-05, batch_norm_momentum: float = 0.1, patch_size: tuple[int, int] = (1, 1))
Bases: VAEArchConfig
Architecture configuration for Flux2 VAE model.
fastvideo.configs.models.vaes.flux2vae.Flux2VAEConfig
dataclass
¶Flux2VAEConfig(arch_config: Flux2VAEArchConfig = Flux2VAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = False, use_temporal_tiling: bool = False, use_parallel_tiling: bool = False, use_temporal_scaling_frames: bool = True)
fastvideo.configs.models.vaes.gamecraftvae
¶
GameCraft VAE config - matches official config.json from Hunyuan-GameCraft-1.0.
Classes¶
fastvideo.configs.models.vaes.gamecraftvae.GameCraftVAEArchConfig
dataclass
¶GameCraftVAEArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), scaling_factor: float = 0.476986, temporal_compression_ratio: int = 4, spatial_compression_ratio: int = 8, in_channels: int = 3, out_channels: int = 3, latent_channels: int = 16, down_block_types: tuple[str, ...] = ('DownEncoderBlockCausal3D', 'DownEncoderBlockCausal3D', 'DownEncoderBlockCausal3D', 'DownEncoderBlockCausal3D'), up_block_types: tuple[str, ...] = ('UpDecoderBlockCausal3D', 'UpDecoderBlockCausal3D', 'UpDecoderBlockCausal3D', 'UpDecoderBlockCausal3D'), block_out_channels: tuple[int, ...] = (128, 256, 512, 512), layers_per_block: int = 2, act_fn: str = 'silu', norm_num_groups: int = 32, time_compression_ratio: int = 4, mid_block_add_attention: bool = True, mid_block_causal_attn: bool = True, sample_size: int = 256, sample_tsize: int = 64)
Bases: VAEArchConfig
Architecture config matching official AutoencoderKLCausal3D config.json.
fastvideo.configs.models.vaes.gamecraftvae.GameCraftVAEConfig
dataclass
¶GameCraftVAEConfig(arch_config: VAEArchConfig = GameCraftVAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = True, use_temporal_tiling: bool = True, use_parallel_tiling: bool = True, use_temporal_scaling_frames: bool = True)
fastvideo.configs.models.vaes.gen3cvae
¶
Classes¶
fastvideo.configs.models.vaes.gen3cvae.Gen3CVAEConfig
dataclass
¶Gen3CVAEConfig(arch_config: CosmosVAEArchConfig = CosmosVAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = False, use_temporal_tiling: bool = False, use_parallel_tiling: bool = False, use_temporal_scaling_frames: bool = True, use_feature_cache: bool = True)
Bases: CosmosVAEConfig
GEN3C VAE config placeholder.
GEN3C uses tokenizer-backed VAE loading logic at runtime, but we keep a model-specific config class so pipeline/model configs stay model-scoped.
fastvideo.configs.models.vaes.oobleck
¶
Config for the Stable Audio Open 1.0 "Oobleck" VAE.
Mirrors the per-channel vae/config.json shipped in
stabilityai/stable-audio-open-1.0 1:1 (see
fastvideo/models/vaes/oobleck.py::OobleckVAE.from_pretrained, which
constructs the VAE from these fields). Inherits the FastVideo VAEConfig
base so the standard load_encoder / load_decoder flags + tiling
knobs apply.
Naming: the VAE architecture is officially "Oobleck" (per Stability
AI's stable-audio-tools) — the surrounding model family is "Stable
Audio Open 1.0". This config is named after the architecture
(OobleckVAEConfig) since the same VAE is shared across Stable Audio
checkpoints; downstream pipelines reference it by its arch name, not
by a host-pipeline name.
Classes¶
fastvideo.configs.models.vaes.oobleck.OobleckVAEArchConfig
dataclass
¶OobleckVAEArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), scaling_factor: float | Tensor = 0, temporal_compression_ratio: int = 4, spatial_compression_ratio: int = 8, architectures: list[str] = (lambda: ['AutoencoderOobleck'])(), encoder_hidden_size: int = 128, downsampling_ratios: list[int] = (lambda: [2, 4, 4, 8, 8])(), channel_multiples: list[int] = (lambda: [1, 2, 4, 8, 16])(), decoder_channels: int = 128, decoder_input_channels: int = 64, audio_channels: int = 2, sampling_rate: int = 44100)
Bases: VAEArchConfig
Stable Audio Open 1.0 VAE architecture constants.
fastvideo.configs.models.vaes.oobleck.OobleckVAEConfig
dataclass
¶OobleckVAEConfig(arch_config: VAEArchConfig = OobleckVAEArchConfig(), load_encoder: bool = True, load_decoder: bool = True, tile_sample_min_height: int = 256, tile_sample_min_width: int = 256, tile_sample_min_num_frames: int = 16, tile_sample_stride_height: int = 192, tile_sample_stride_width: int = 192, tile_sample_stride_num_frames: int = 12, blend_num_frames: int = 0, use_tiling: bool = False, use_temporal_tiling: bool = False, use_parallel_tiling: bool = False, use_temporal_scaling_frames: bool = True, pretrained_path: str = 'stabilityai/stable-audio-open-1.0', pretrained_subfolder: str = 'vae', pretrained_dtype: str = 'float16')
Bases: VAEConfig
FastVideo VAE config wrapping the Oobleck arch.
Audio VAEs don't use the temporal/spatial tiling defaults that the base VAEConfig is shaped for (those exist for video VAEs); they are retained but irrelevant for audio.