gen3c ¶

Classes¶

fastvideo.configs.pipelines.gen3c.Gen3CConfig `dataclass` ¶

Gen3CConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: int = 6, flow_shift: float = 1.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = Gen3CVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = Gen3CVAEConfig(), vae_precision: str = 'bf16', vae_tiling: bool = True, vae_sp: bool = True, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (_Gen3CT5LargeConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5_large_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, conditioning_strategy: str = 'frame_replace', min_num_conditional_frames: int = 1, max_num_conditional_frames: int = 2, sigma_conditional: float = 0.001, sigma_data: float = 0.5, state_ch: int = 16, state_t: int = 16, text_encoder_class: str = 'T5', frame_buffer_max: int = 2, noise_aug_strength: float = 0.0, filter_points_threshold: float = 0.05, use_moge_depth: bool = True, moge_model_name: str = 'Ruicheng/moge-vitl', offload_moge_after_depth: bool = True, default_trajectory_type: str = 'left', default_movement_distance: float = 0.3, default_camera_rotation: str = 'center_facing', video_resolution: tuple[int, int] = (704, 1280), num_frames: int = 121, fps: int = 24, cfg_behavior: str = 'legacy', default_negative_prompt: str = 'The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality.', autoregressive_chunk_frames: int = 121, autoregressive_overlap_frames: int = 1)

Bases: PipelineConfig

Configuration for GEN3C Video Generation Pipeline.

GEN3C extends Cosmos with 3D cache for camera-controlled video generation. Key parameters: - frame_buffer_max: Number of 3D cache buffers (default: 2) - noise_aug_strength: Strength of noise augmentation per buffer - filter_points_threshold: Threshold for filtering unreliable depth points

fastvideo.configs.pipelines.gen3c.Gen3CInferenceConfig `dataclass` ¶

Gen3CInferenceConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: int = 6, flow_shift: float = 1.0, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = Gen3CVideoConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = Gen3CVAEConfig(), vae_precision: str = 'bf16', vae_tiling: bool = True, vae_sp: bool = True, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (_Gen3CT5LargeConfig(),))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('bf16',))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (preprocess_text,))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (t5_large_postprocess_text,))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None, conditioning_strategy: str = 'frame_replace', min_num_conditional_frames: int = 1, max_num_conditional_frames: int = 2, sigma_conditional: float = 0.001, sigma_data: float = 0.5, state_ch: int = 16, state_t: int = 16, text_encoder_class: str = 'T5', frame_buffer_max: int = 2, noise_aug_strength: float = 0.0, filter_points_threshold: float = 0.05, use_moge_depth: bool = True, moge_model_name: str = 'Ruicheng/moge-vitl', offload_moge_after_depth: bool = True, default_trajectory_type: str = 'left', default_movement_distance: float = 0.3, default_camera_rotation: str = 'center_facing', video_resolution: tuple[int, int] = (704, 1280), num_frames: int = 121, fps: int = 24, cfg_behavior: str = 'legacy', default_negative_prompt: str = 'The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality.', autoregressive_chunk_frames: int = 121, autoregressive_overlap_frames: int = 1, batch_size: int = 1, gradient_checkpointing: bool = False, guidance_scale: float = 1.0, num_inference_steps: int = 35)

Bases: Gen3CConfig

Configuration for GEN3C inference with optimized defaults.

Functions¶

fastvideo.configs.pipelines.gen3c.t5_large_postprocess_text ¶

t5_large_postprocess_text(outputs: BaseEncoderOutput) -> Tensor

Postprocess T5 Large text encoder outputs for GEN3C pipeline.

Return raw last_hidden_state without truncation/padding.

Source code in fastvideo/configs/pipelines/gen3c.py

def t5_large_postprocess_text(outputs: BaseEncoderOutput) -> torch.Tensor:
    """Postprocess T5 Large text encoder outputs for GEN3C pipeline.

    Return raw last_hidden_state without truncation/padding.
    """
    hidden_state = outputs.last_hidden_state

    if hidden_state is None:
        raise ValueError("T5 Large outputs missing last_hidden_state")

    nan_count = torch.isnan(hidden_state).sum()
    if nan_count > 0:
        hidden_state = hidden_state.masked_fill(torch.isnan(hidden_state), 0.0)

    # Zero out embeddings beyond actual sequence length (vectorized)
    if outputs.attention_mask is not None:
        attention_mask = outputs.attention_mask
        lengths = attention_mask.sum(dim=1)
        max_len = hidden_state.shape[1]
        mask = torch.arange(max_len, device=hidden_state.device)[None, :] >= lengths[:, None]
        hidden_state[mask] = 0.0

    return hidden_state