Skip to content

hunyuangamecraft

Pipeline configuration for HunyuanGameCraft.

HunyuanGameCraft extends HunyuanVideo with: 1. CameraNet for camera/action conditioning (Plücker coordinates) 2. Mask-based conditioning for autoregressive generation 3. 33 input channels (16 latent + 16 gt_latent + 1 mask)

Text encoders are the same as HunyuanVideo: - LLaVA-LLaMA-3-8B for primary text encoding (4096 dim) - CLIP ViT-L/14 for secondary pooled embeddings (768 dim)

Classes

fastvideo.configs.pipelines.hunyuangamecraft.HunyuanGameCraftPipelineConfig dataclass

HunyuanGameCraftPipelineConfig(model_path: str = '', pipeline_config_path: str | None = None, embedded_cfg_scale: float = 6.0, flow_shift: int = 5, flow_shift_sr: float | None = None, disable_autocast: bool = False, is_causal: bool = False, dit_config: DiTConfig = HunyuanGameCraftConfig(), dit_precision: str = 'bf16', upsampler_config: UpsamplerConfig = UpsamplerConfig(), upsampler_precision: str = 'fp32', vae_config: VAEConfig = GameCraftVAEConfig(), vae_precision: str = 'fp16', vae_tiling: bool = True, vae_sp: bool = True, image_encoder_config: EncoderConfig = EncoderConfig(), image_encoder_precision: str = 'fp32', text_encoder_configs: tuple[EncoderConfig, ...] = (lambda: (LlamaConfig(), CLIPTextConfig()))(), text_encoder_precisions: tuple[str, ...] = (lambda: ('fp16', 'fp16'))(), preprocess_text_funcs: tuple[Callable[[str], str], ...] = (lambda: (llama_preprocess_text, clip_preprocess_text))(), postprocess_text_funcs: tuple[Callable[[BaseEncoderOutput], Tensor], ...] = (lambda: (llama_postprocess_text, clip_postprocess_text))(), dmd_denoising_steps: list[int] | None = None, ti2v_task: bool = False, boundary_ratio: float | None = None)

Bases: PipelineConfig

Configuration for HunyuanGameCraft pipeline.

Inherits text encoding from HunyuanVideo but uses: - GameCraft DiT with CameraNet - Same VAE (HunyuanVAE) - Same text encoders (LLaMA + CLIP)