encoders
¶
Classes¶
fastvideo.configs.models.encoders.Mistral3TextConfig
dataclass
¶
Mistral3TextConfig(arch_config: TextEncoderArchConfig = Mistral3TextArchConfig(), prefix: str = 'mistral3', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = True, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
Top-level config for the Mistral3 full Flux2 text encoder.
fastvideo.configs.models.encoders.Qwen3TextConfig
dataclass
¶
Qwen3TextConfig(arch_config: TextEncoderArchConfig = Qwen3TextArchConfig(), prefix: str = 'qwen3', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = True, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
Top-level config for Qwen3 text encoder.
fastvideo.configs.models.encoders.Reason1ArchConfig
dataclass
¶
Reason1ArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), architectures: list[str] = (lambda: ['Qwen2_5_VLForConditionalGeneration'])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = True, use_return_dict: bool = True, vocab_size: int = 152064, hidden_size: int = 3584, num_hidden_layers: int = 28, num_attention_heads: int = 28, pad_token_id: int = 151643, eos_token_id: int = 151645, text_len: int = 512, hidden_state_skip_layer: int = 0, decoder_start_token_id: int = 0, output_past: bool = True, scalable_attention: bool = True, tie_word_embeddings: bool = False, tokenizer_kwargs: dict[str, Any] = dict(), _fsdp_shard_conditions: list = (lambda: [])(), require_processor: bool = False, model_type: str = 'qwen2_5_vl', num_key_value_heads: int = 4, intermediate_size: int = 18944, bos_token_id: int = 151643, image_token_id: int = 151655, video_token_id: int = 151656, vision_token_id: int = 151654, vision_start_token_id: int = 151652, vision_end_token_id: int = 151653, vision_config: dict[str, Any] | None = None, rope_theta: float = 1000000.0, rope_scaling: dict[str, Any] | None = (lambda: {'type': 'mrope', 'mrope_section': [16, 24, 24]})(), max_position_embeddings: int = 128000, max_window_layers: int = 28, embedding_concat_strategy: str = 'mean_pooling', n_layers_per_group: int = 5, num_embedding_padding_tokens: int = 512, attention_dropout: float = 0.0, hidden_act: str = 'silu', initializer_range: float = 0.02, rms_norm_eps: float = 1e-06, use_sliding_window: bool = False, sliding_window: int = 32768, use_cache: bool = False, torch_dtype: str = 'bfloat16', _attn_implementation: str = 'flash_attention_2')
Bases: TextEncoderArchConfig
Architecture settings (defaults match Qwen2.5-VL-7B-Instruct).
fastvideo.configs.models.encoders.Reason1Config
dataclass
¶
Reason1Config(arch_config: Reason1ArchConfig = Reason1ArchConfig(), prefix: str = '', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = False, treat_empty_as_dot: bool = False, tokenizer_type: str = 'Qwen/Qwen2.5-VL-7B-Instruct')
Bases: TextEncoderConfig
Reason1 text encoder config.
fastvideo.configs.models.encoders.SiglipVisionConfig
dataclass
¶
SiglipVisionConfig(arch_config: ImageEncoderArchConfig = SiglipVisionArchConfig(), prefix: str = 'siglip', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, num_hidden_layers_override: int | None = None, require_post_norm: bool | None = None, enable_scale: bool = True, is_causal: bool = False)
Bases: ImageEncoderConfig
Configuration for SigLIP vision encoder.
fastvideo.configs.models.encoders.T5LargeConfig
dataclass
¶
T5LargeConfig(arch_config: TextEncoderArchConfig = T5LargeArchConfig(), prefix: str = 't5', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = False, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
T5 Large configuration for your specific model.
Modules¶
fastvideo.configs.models.encoders.mistral3
¶
Mistral3 text encoder configuration for full Flux2.
Classes¶
fastvideo.configs.models.encoders.mistral3.Mistral3TextArchConfig
dataclass
¶
Mistral3TextArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), architectures: list[str] = (lambda: ['Mistral3ForConditionalGeneration'])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = True, use_return_dict: bool = True, vocab_size: int = 0, hidden_size: int = 5120, num_hidden_layers: int = 40, num_attention_heads: int = 0, pad_token_id: int = 0, eos_token_id: int = 0, text_len: int = 512, hidden_state_skip_layer: int = 0, decoder_start_token_id: int = 0, output_past: bool = True, scalable_attention: bool = True, tie_word_embeddings: bool = False, tokenizer_kwargs: dict[str, Any] = dict(), _fsdp_shard_conditions: list = (lambda: [])(), require_processor: bool = True)
Bases: TextEncoderArchConfig
Architecture config for the Mistral3 text encoder used by full Flux2.
fastvideo.configs.models.encoders.mistral3.Mistral3TextConfig
dataclass
¶
Mistral3TextConfig(arch_config: TextEncoderArchConfig = Mistral3TextArchConfig(), prefix: str = 'mistral3', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = True, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
Top-level config for the Mistral3 full Flux2 text encoder.
fastvideo.configs.models.encoders.qwen3
¶
Qwen3 text encoder configuration for FastVideo diffusion models (e.g. Flux2 Klein).
Classes¶
fastvideo.configs.models.encoders.qwen3.Qwen3TextArchConfig
dataclass
¶
Qwen3TextArchConfig(stacked_params_mapping: list[tuple[str, str, str | int]] = (lambda: [('.qkv_proj', '.q_proj', 'q'), ('.qkv_proj', '.k_proj', 'k'), ('.qkv_proj', '.v_proj', 'v'), ('.gate_up_proj', '.gate_proj', 0), ('.gate_up_proj', '.up_proj', 1)])(), architectures: list[str] = (lambda: [])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = True, use_return_dict: bool = True, vocab_size: int = 151936, hidden_size: int = 2560, num_hidden_layers: int = 36, num_attention_heads: int = 32, pad_token_id: int = 151643, eos_token_id: int = 151645, text_len: int = 512, hidden_state_skip_layer: int = 0, decoder_start_token_id: int = 0, output_past: bool = True, scalable_attention: bool = True, tie_word_embeddings: bool = True, tokenizer_kwargs: dict[str, Any] = dict(), _fsdp_shard_conditions: list = (lambda: [_is_transformer_layer, _is_embeddings, _is_final_norm])(), require_processor: bool = False, intermediate_size: int = 9728, num_key_value_heads: int = 8, hidden_act: str = 'silu', max_position_embeddings: int = 40960, initializer_range: float = 0.02, rms_norm_eps: float = 1e-06, use_cache: bool = True, bos_token_id: int = 151643, rope_theta: float = 1000000.0, rope_scaling: dict | None = None, attention_bias: bool = False, attention_dropout: float = 0.0, mlp_bias: bool = False, head_dim: int = 128)
Bases: TextEncoderArchConfig
Architecture config for Qwen3 text encoder.
Qwen3 is similar to LLaMA but with QK-Norm (RMSNorm on Q and K before attention). Used by Flux2 Klein.
fastvideo.configs.models.encoders.qwen3.Qwen3TextConfig
dataclass
¶
Qwen3TextConfig(arch_config: TextEncoderArchConfig = Qwen3TextArchConfig(), prefix: str = 'qwen3', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = True, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
Top-level config for Qwen3 text encoder.
fastvideo.configs.models.encoders.reason1
¶
Config for Reason1 (Qwen2.5-VL) text encoder.
Classes¶
fastvideo.configs.models.encoders.reason1.Reason1ArchConfig
dataclass
¶
Reason1ArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = list(), architectures: list[str] = (lambda: ['Qwen2_5_VLForConditionalGeneration'])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = True, use_return_dict: bool = True, vocab_size: int = 152064, hidden_size: int = 3584, num_hidden_layers: int = 28, num_attention_heads: int = 28, pad_token_id: int = 151643, eos_token_id: int = 151645, text_len: int = 512, hidden_state_skip_layer: int = 0, decoder_start_token_id: int = 0, output_past: bool = True, scalable_attention: bool = True, tie_word_embeddings: bool = False, tokenizer_kwargs: dict[str, Any] = dict(), _fsdp_shard_conditions: list = (lambda: [])(), require_processor: bool = False, model_type: str = 'qwen2_5_vl', num_key_value_heads: int = 4, intermediate_size: int = 18944, bos_token_id: int = 151643, image_token_id: int = 151655, video_token_id: int = 151656, vision_token_id: int = 151654, vision_start_token_id: int = 151652, vision_end_token_id: int = 151653, vision_config: dict[str, Any] | None = None, rope_theta: float = 1000000.0, rope_scaling: dict[str, Any] | None = (lambda: {'type': 'mrope', 'mrope_section': [16, 24, 24]})(), max_position_embeddings: int = 128000, max_window_layers: int = 28, embedding_concat_strategy: str = 'mean_pooling', n_layers_per_group: int = 5, num_embedding_padding_tokens: int = 512, attention_dropout: float = 0.0, hidden_act: str = 'silu', initializer_range: float = 0.02, rms_norm_eps: float = 1e-06, use_sliding_window: bool = False, sliding_window: int = 32768, use_cache: bool = False, torch_dtype: str = 'bfloat16', _attn_implementation: str = 'flash_attention_2')
Bases: TextEncoderArchConfig
Architecture settings (defaults match Qwen2.5-VL-7B-Instruct).
fastvideo.configs.models.encoders.reason1.Reason1Config
dataclass
¶
Reason1Config(arch_config: Reason1ArchConfig = Reason1ArchConfig(), prefix: str = '', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = False, treat_empty_as_dot: bool = False, tokenizer_type: str = 'Qwen/Qwen2.5-VL-7B-Instruct')
Bases: TextEncoderConfig
Reason1 text encoder config.
fastvideo.configs.models.encoders.siglip
¶
SigLIP vision encoder configuration for FastVideo.
Classes¶
fastvideo.configs.models.encoders.siglip.SiglipVisionArchConfig
dataclass
¶
SiglipVisionArchConfig(stacked_params_mapping: list = (lambda: [('qkv_proj', 'q_proj', 'q'), ('qkv_proj', 'k_proj', 'k'), ('qkv_proj', 'v_proj', 'v')])(), architectures: list[str] = (lambda: ['SiglipVisionModel'])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = False, use_return_dict: bool = True, attention_dropout: float = 0.0, dtype: str | None = None, hidden_act: str = 'gelu_pytorch_tanh', hidden_size: int = 1152, image_size: int = 384, intermediate_size: int = 4304, layer_norm_eps: float = 1e-06, model_type: str = 'siglip_vision_model', num_attention_heads: int = 16, num_channels: int = 3, num_hidden_layers: int = 27, patch_size: int = 14)
Bases: ImageEncoderArchConfig
Architecture configuration for SigLIP vision encoder.
Fields match the config.json from HuggingFace SigLIP checkpoints.
fastvideo.configs.models.encoders.siglip.SiglipVisionConfig
dataclass
¶
SiglipVisionConfig(arch_config: ImageEncoderArchConfig = SiglipVisionArchConfig(), prefix: str = 'siglip', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, num_hidden_layers_override: int | None = None, require_post_norm: bool | None = None, enable_scale: bool = True, is_causal: bool = False)
Bases: ImageEncoderConfig
Configuration for SigLIP vision encoder.
fastvideo.configs.models.encoders.stable_audio_conditioner
¶
Config for the Stable Audio Open 1.0 multi-conditioner.
The conditioner bundles three sub-conditioners — a T5 text encoder
(prompt) and two NumberConditioners (seconds_start / seconds_total)
— into the (cross_attn_cond, cross_attn_mask, global_embed) triple the
DiT consumes. The architecture is fully specified by the official
stable_audio_tools MultiConditioner config; the constants here
mirror that.
fastvideo.configs.models.encoders.t5
¶
Classes¶
fastvideo.configs.models.encoders.t5.T5LargeArchConfig
dataclass
¶
T5LargeArchConfig(stacked_params_mapping: list[tuple[str, str, str]] = (lambda: [('.qkv_proj', '.q', 'q'), ('.qkv_proj', '.k', 'k'), ('.qkv_proj', '.v', 'v')])(), architectures: list[str] = (lambda: [])(), _supported_attention_backends: tuple[AttentionBackendEnum, ...] = (FLASH_ATTN, TORCH_SDPA), output_hidden_states: bool = False, use_return_dict: bool = True, vocab_size: int = 32128, hidden_size: int = 0, num_hidden_layers: int = 0, num_attention_heads: int = 0, pad_token_id: int = 0, eos_token_id: int = 1, text_len: int = 512, hidden_state_skip_layer: int = 0, decoder_start_token_id: int = 0, output_past: bool = True, scalable_attention: bool = True, tie_word_embeddings: bool = False, tokenizer_kwargs: dict[str, Any] = dict(), _fsdp_shard_conditions: list = (lambda: [_is_transformer_layer, _is_embeddings, _is_final_layernorm])(), require_processor: bool = False, d_model: int = 1024, d_kv: int = 128, d_ff: int = 65536, num_layers: int = 24, num_decoder_layers: int | None = 24, num_heads: int = 128, relative_attention_num_buckets: int = 32, relative_attention_max_distance: int = 128, dropout_rate: float = 0.1, layer_norm_epsilon: float = 1e-06, initializer_factor: float = 1.0, feed_forward_proj: str = 'relu', dense_act_fn: str = '', is_gated_act: bool = False, is_encoder_decoder: bool = True, use_cache: bool = True, classifier_dropout: float = 0.0, dtype: str | None = None, gradient_checkpointing: bool = False, n_positions: int = 512, task_specific_params: dict | None = None)
Bases: T5ArchConfig
T5 Large architecture config with parameters for your specific model.
fastvideo.configs.models.encoders.t5.T5LargeConfig
dataclass
¶
T5LargeConfig(arch_config: TextEncoderArchConfig = T5LargeArchConfig(), prefix: str = 't5', quant_config: QuantizationConfig | None = None, lora_config: Any | None = None, is_chat_model: bool = False, treat_empty_as_dot: bool = False)
Bases: TextEncoderConfig
T5 Large configuration for your specific model.
fastvideo.configs.models.encoders.t5gemma
¶
Config for the T5-Gemma encoder used by daVinci-MagiHuman.
The reference pipeline uses transformers.models.t5gemma.T5GemmaEncoderModel
on google/t5gemma-9b-9b-ul2. That is a gated Google repository, so the
encoder weights are not bundled inside GAIR/daVinci-MagiHuman; they are
loaded from the T5-Gemma HF repo directly.
Encoder shape (verified from google/t5gemma-9b-9b-ul2/config.json): layers=42, hidden=3584, heads=16, kv_heads=8, head_dim=256, intermediate=14336, rope_theta=10000.0, max_pos=8192, layer_types alternate sliding_attention / full_attention.