methods
¶
Classes¶
fastvideo.train.methods.TrainingMethod
¶
Bases: Module, ABC
Base training method (algorithm layer).
Subclasses own their role models (student, teacher, critic, …) as
plain attributes and manage optimizers directly — no RoleManager
or RoleHandle.
The constructor receives role_models (a dict[str, ModelBase])
and a cfg object. It calls init_preprocessors on the student
and builds self.role_modules for FSDP wrapping.
A single shared CUDA RNG generator (cuda_generator) is
created in :meth:on_train_start. All torch.randn /
torch.randint calls in methods and models must use this
generator instead of relying on global RNG state.
Source code in fastvideo/train/methods/base.py
Functions¶
fastvideo.train.methods.TrainingMethod.checkpoint_state
¶
Return DCP-ready checkpoint state for all trainable roles.
Keys follow the convention:
roles.<role>.<module>, optimizers.<role>,
schedulers.<role>, random_state.*.
EMA state is managed by the EMACallback and is
checkpointed through the callback state mechanism.
Source code in fastvideo/train/methods/base.py
fastvideo.train.methods.TrainingMethod.get_grad_clip_targets
¶
Return modules whose gradients should be clipped.
Override in subclasses to add/conditionally include modules (e.g. critic, conditionally student). Default: student transformer.
Source code in fastvideo/train/methods/base.py
fastvideo.train.methods.TrainingMethod.seed_optimizer_state_for_resume
¶
Seed optimizer state so DCP can load saved state.
A fresh optimizer has empty state (exp_avg, exp_avg_sq, step are only created on the first optimizer.step()). DCP needs matching entries to load into; without them the saved optimizer state is silently dropped.
Source code in fastvideo/train/methods/base.py
Modules¶
fastvideo.train.methods.base
¶
Classes¶
fastvideo.train.methods.base.TrainingMethod
¶
Bases: Module, ABC
Base training method (algorithm layer).
Subclasses own their role models (student, teacher, critic, …) as
plain attributes and manage optimizers directly — no RoleManager
or RoleHandle.
The constructor receives role_models (a dict[str, ModelBase])
and a cfg object. It calls init_preprocessors on the student
and builds self.role_modules for FSDP wrapping.
A single shared CUDA RNG generator (cuda_generator) is
created in :meth:on_train_start. All torch.randn /
torch.randint calls in methods and models must use this
generator instead of relying on global RNG state.
Source code in fastvideo/train/methods/base.py
Functions¶
fastvideo.train.methods.base.TrainingMethod.checkpoint_state
¶Return DCP-ready checkpoint state for all trainable roles.
Keys follow the convention:
roles.<role>.<module>, optimizers.<role>,
schedulers.<role>, random_state.*.
EMA state is managed by the EMACallback and is
checkpointed through the callback state mechanism.
Source code in fastvideo/train/methods/base.py
fastvideo.train.methods.base.TrainingMethod.get_grad_clip_targets
¶Return modules whose gradients should be clipped.
Override in subclasses to add/conditionally include modules (e.g. critic, conditionally student). Default: student transformer.
Source code in fastvideo/train/methods/base.py
fastvideo.train.methods.base.TrainingMethod.seed_optimizer_state_for_resume
¶Seed optimizer state so DCP can load saved state.
A fresh optimizer has empty state (exp_avg, exp_avg_sq, step are only created on the first optimizer.step()). DCP needs matching entries to load into; without them the saved optimizer state is silently dropped.
Source code in fastvideo/train/methods/base.py
Functions¶
fastvideo.train.methods.distribution_matching
¶
Classes¶
fastvideo.train.methods.distribution_matching.DMD2Method
¶
Bases: TrainingMethod
DMD2 distillation algorithm (method layer).
Owns role model instances directly:
- self.student — trainable student :class:ModelBase
- self.teacher — frozen teacher :class:ModelBase
- self.critic — trainable critic :class:ModelBase
Source code in fastvideo/train/methods/distribution_matching/dmd2.py
fastvideo.train.methods.distribution_matching.SelfForcingMethod
¶
Bases: DMD2Method
Self-Forcing DMD2 (distribution matching) method.
Requires a causal student implementing CausalModelBase.
Source code in fastvideo/train/methods/distribution_matching/self_forcing.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | |
Modules¶
fastvideo.train.methods.distribution_matching.dmd2
¶
DMD2 distillation method (algorithm layer).
Classes¶
fastvideo.train.methods.distribution_matching.dmd2.DMD2Method
¶
Bases: TrainingMethod
DMD2 distillation algorithm (method layer).
Owns role model instances directly:
- self.student — trainable student :class:ModelBase
- self.teacher — frozen teacher :class:ModelBase
- self.critic — trainable critic :class:ModelBase
Source code in fastvideo/train/methods/distribution_matching/dmd2.py
Functions¶
fastvideo.train.methods.distribution_matching.self_forcing
¶
Self-Forcing distillation method (algorithm layer).
Classes¶
fastvideo.train.methods.distribution_matching.self_forcing.SelfForcingMethod
¶
Bases: DMD2Method
Self-Forcing DMD2 (distribution matching) method.
Requires a causal student implementing CausalModelBase.
Source code in fastvideo/train/methods/distribution_matching/self_forcing.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | |
Functions¶
fastvideo.train.methods.fine_tuning
¶
Classes¶
fastvideo.train.methods.fine_tuning.DiffusionForcingSFTMethod
¶
Bases: TrainingMethod
Diffusion-forcing SFT (DFSFT): train only student
with inhomogeneous timesteps.
Source code in fastvideo/train/methods/fine_tuning/dfsft.py
fastvideo.train.methods.fine_tuning.FineTuneMethod
¶
Bases: TrainingMethod
Supervised finetuning: only student participates.
Source code in fastvideo/train/methods/fine_tuning/finetune.py
Modules¶
fastvideo.train.methods.fine_tuning.dfsft
¶
Diffusion-forcing SFT method (DFSFT; algorithm layer).
Classes¶
fastvideo.train.methods.fine_tuning.dfsft.DiffusionForcingSFTMethod
¶
Bases: TrainingMethod
Diffusion-forcing SFT (DFSFT): train only student
with inhomogeneous timesteps.
Source code in fastvideo/train/methods/fine_tuning/dfsft.py
Functions¶
fastvideo.train.methods.fine_tuning.finetune
¶
Supervised finetuning method (algorithm layer).
Classes¶
fastvideo.train.methods.fine_tuning.finetune.FineTuneMethod
¶
Bases: TrainingMethod
Supervised finetuning: only student participates.
Source code in fastvideo/train/methods/fine_tuning/finetune.py
Functions¶
fastvideo.train.methods.knowledge_distillation
¶
Classes¶
fastvideo.train.methods.knowledge_distillation.KDCausalMethod
¶
Bases: KDMethod
KD for causal Wan: per-frame block-quantized timestep sampling.
Identical to :class:KDMethod except single_train_step samples
a per-frame denoising step index (block-quantized to groups of
num_frames_per_block frames) instead of one index per batch.
This matches the legacy ODEInitTrainingPipeline training scheme
required by causal / streaming student models.
Additional YAML field under method::
num_frames_per_block: 3 # frames sharing the same noise level
Source code in fastvideo/train/methods/knowledge_distillation/kd.py
fastvideo.train.methods.knowledge_distillation.KDMethod
¶
Bases: TrainingMethod
Knowledge Distillation training method.
Trains the student with MSE loss on teacher ODE trajectories cached
to method_config.teacher_path_cache.
Roles
student(required, trainable): the model being distilled.teacher(optional, non-trainable): used to generate the cache on first run; freed from GPU memory afterwards.
If the cache is incomplete and no teacher is configured, an error is raised at the start of training.
Source code in fastvideo/train/methods/knowledge_distillation/kd.py
Modules¶
fastvideo.train.methods.knowledge_distillation.kd
¶
Knowledge Distillation method for ODE-init training.
Trains a student model with MSE loss to reproduce a teacher model's
multi-step ODE denoising trajectories. The resulting checkpoint
(exported via dcp_to_diffusers) serves as the ode_init weight
initialization for downstream Self-Forcing training.
Teacher path generation is cached to disk so it only runs once. Interrupted generation resumes from the last completed sample.
Typical YAML::
models:
student:
_target_: fastvideo.train.models.wan.WanModel
init_from: Wan-AI/Wan2.1-T2V-1.3B-Diffusers
trainable: true
teacher: # omit once cache is complete
_target_: fastvideo.train.models.wan.WanModel
init_from: Wan-AI/Wan2.1-T2V-14B-Diffusers
trainable: false
disable_custom_init_weights: true
method:
_target_: fastvideo.train.methods.knowledge_distillation.kd.KDMethod
teacher_path_cache: /data/kd_cache/wan14b_4step
t_list: [999, 937, 833, 624, 0] # integer timesteps
student_sample_steps: 4
teacher_guidance_scale: 1.0
Classes¶
fastvideo.train.methods.knowledge_distillation.kd.KDCausalMethod
¶
Bases: KDMethod
KD for causal Wan: per-frame block-quantized timestep sampling.
Identical to :class:KDMethod except single_train_step samples
a per-frame denoising step index (block-quantized to groups of
num_frames_per_block frames) instead of one index per batch.
This matches the legacy ODEInitTrainingPipeline training scheme
required by causal / streaming student models.
Additional YAML field under method::
num_frames_per_block: 3 # frames sharing the same noise level
Source code in fastvideo/train/methods/knowledge_distillation/kd.py
fastvideo.train.methods.knowledge_distillation.kd.KDMethod
¶
Bases: TrainingMethod
Knowledge Distillation training method.
Trains the student with MSE loss on teacher ODE trajectories cached
to method_config.teacher_path_cache.
Roles
student(required, trainable): the model being distilled.teacher(optional, non-trainable): used to generate the cache on first run; freed from GPU memory afterwards.
If the cache is incomplete and no teacher is configured, an error is raised at the start of training.