eval
¶
Classes¶
fastvideo.eval.BaseMetric
¶
Abstract base class for all eval metrics.
Two execution shapes:
- Per-sample (
is_set_metric=False, default) — implement :meth:compute. The Evaluator calls it once per input sample and returns one :class:MetricResultper sample.
- Set-vs-set (
is_set_metric=True) — implement :meth:accumulate(called once per sample to buffer features) and :meth:finalize(called once after all samples to compute the corpus-level result). Use :meth:resetto clear buffers and :meth:merge_fromto fold multi-GPU per-worker state together.
Optionally override :meth:setup to eagerly load models. Metrics
that chunk along the time dim for memory hardcode their own chunk
size in __init__ (see optical_flow for the canonical
example). Eval always processes one video per
:meth:Evaluator.evaluate call; compute / accumulate
receive a single sample, not a batch.
Source code in fastvideo/eval/metrics/base.py
Functions¶
fastvideo.eval.BaseMetric.compute
¶
compute(sample: dict) -> MetricResult
Per-sample metrics: compute the score for one sample.
sample["video"] is (T, C, H, W) float in [0, 1].
sample["reference"] (if used) has the same shape. Return
self._skip(sample, reason) for missing inputs.
Source code in fastvideo/eval/metrics/base.py
fastvideo.eval.BaseMetric.finalize
¶
finalize() -> MetricResult
fastvideo.eval.BaseMetric.merge_from
¶
merge_from(other: BaseMetric) -> None
fastvideo.eval.BaseMetric.reset
¶
fastvideo.eval.BaseMetric.setup
¶
Eagerly load models. Called once by :class:EvalWorker.
Default is a no-op; metrics with no eager state (pixel math, closed-form ops) inherit this. Override only if your metric needs to load weights.
Source code in fastvideo/eval/metrics/base.py
fastvideo.eval.BaseMetric.to
¶
to(device: str | device) -> BaseMetric
fastvideo.eval.EvalResults
¶
EvalResults(samples: list[dict[str, MetricResult]] | None = None, corpus: dict[str, MetricResult] | None = None)
Bases: list
Return type for :meth:Evaluator.evaluate with samples=....
Behaves like a list[dict[str, MetricResult]] — one dict per
input sample, in input order — so existing iteration and indexing
keeps working. The corpus attribute carries set-metric results
(FAD, IS, …) that are properties of the whole input set, not of
any individual sample. Empty dict when no set metric ran.
Source code in fastvideo/eval/types.py
fastvideo.eval.Evaluator
¶
Evaluator(metrics: list[str] | str = 'all', device: str = 'cuda:0', num_gpus: int = 1, compile: bool = False, *, loader_threads: int = 1, prefetch_factor: int = 2, pre_upload: bool = True, skip_missing_deps: bool = False)
Pre-initialized scorer for repeated evaluation.
Parameters¶
metrics : list[str] | str
Metric names, group prefixes ("vbench"), or "all".
device : str
Single-GPU device (e.g. "cuda:0"). Ignored when num_gpus > 1.
num_gpus : int
Number of GPU replicas. Each gets its own :class:EvalWorker.
compile : bool
Apply :func:torch.compile to each metric's _model.
loader_threads : int
Background decode threads in the :class:VideoPool. Default 1
(hide decode behind compute). Bump for I/O-heavy benchmark sets
where one loader can't keep up with the workers.
prefetch_factor : int
pool max_size = prefetch_factor * num_workers. Default 2 —
one sample being consumed, one prefetched per worker.
pre_upload : bool
When True (default), the worker performs a single
host→device upload of video / reference per sample
before the metric loop, and every metric reads from that
shared GPU-resident tensor. Without it, each metric pays its
own .to(self.device) — N transfers of the same clip for N
metrics, which dominates at high resolution. Set False for
training-time eval, where keeping a clip resident on GPU
across the metric loop would fight the training step for VRAM.
skip_missing_deps : bool
When True, silently drop explicit metric names whose
optional deps aren't importable (with a one-line warning per
skipped metric). Default False — an explicit name with a
missing dep raises :class:ImportError at construction time.
Group selectors ("vbench", "all") always silent-skip
regardless of this flag.
Source code in fastvideo/eval/evaluator.py
Functions¶
fastvideo.eval.Evaluator.evaluate
¶
evaluate(samples: Iterable[dict] | None = None, *, metrics: list[str] | None = None, **kwargs) -> dict[str, MetricResult] | EvalResults
Score one sample (kwargs form) or many samples (list form).
Both forms go through the same :class:VideoPool pipeline;
video / reference paths are decoded asynchronously.
Parameters¶
samples :
Iterable of sample dicts. Omit and pass kwargs for a
single-sample call.
metrics :
Subset of this Evaluator's registered metrics to actually
run on this batch. None (default) runs all registered.
Lets a single long-lived Evaluator score different (gen,
ref) corpora with different metric subsets across multiple
evaluate() calls — e.g. LPIPS on a paired corpus, FVD
on an unequal-cardinality corpus — without burning model
loads. Set-metric accumulators are reset only for the
metrics included in metrics, so state for other set
metrics is preserved across calls.
Single sample::
ev.evaluate(video=tensor, text_prompt="...", fps=24.0)
Many samples::
ev.evaluate(samples=[{"video": ..., "reference": ...}, ...])
Many samples with a metric filter::
ev.evaluate(samples=lpips_samples, metrics=["common.lpips"])
ev.evaluate(samples=fvd_samples, metrics=["common.fvd"])
Returns¶
dict[str, MetricResult] for the single-sample form;
:class:EvalResults (list-of-dict subclass with .corpus) for
the list form.
Source code in fastvideo/eval/evaluator.py
fastvideo.eval.Evaluator.release_cuda_memory
¶
fastvideo.eval.Evaluator.reload
¶
fastvideo.eval.Evaluator.shutdown
¶
fastvideo.eval.Evaluator.unload
¶
fastvideo.eval.MetricResult
dataclass
¶
Standard result container returned by all metrics.
score is None when the metric was skipped (e.g. missing
required input). Check details["skipped"] for the reason.
fastvideo.eval.Video
dataclass
¶
Video(source: Any, fps: float | None = None, frames: Any = None, audio: Any = None, audio_sr: int | None = None)
Path-backed media handle. The :class:VideoPool populates
frames (and optionally audio) before the metric loop sees
the sample.
Functions¶
fastvideo.eval.as_video
¶
Coerce path/tensor/Video → :class:Video for the pool to decode.
Path strings and :class:pathlib.Path become Video(source=str(x));
the pool then calls :func:load_video on first use. Tensors become
Video(source=None, frames=x) — the pool sees .frames already
populated and forwards untouched. :class:Video instances pass through.
Source code in fastvideo/eval/io/inputs.py
fastvideo.eval.ensure_checkpoint
¶
Resolve a model checkpoint path, downloading on miss.
See module docstring for the full source contract. name is used only as the local cache filename for URL sources; ignored otherwise.
Source code in fastvideo/eval/models.py
fastvideo.eval.evaluate
¶
evaluate(generated: Tensor | str | Path, reference: Tensor | str | Path | None = None, metrics: list[str] | str = 'all', device: str = 'cuda', **kwargs) -> dict[str, MetricResult] | list[dict[str, MetricResult]]
One-shot evaluation. For repeated use, prefer :func:create_evaluator.
Parameters¶
generated : Tensor | str | Path
Generated video. Either a pre-loaded (T, C, H, W) tensor or a
path to an mp4/avi/etc. — paths are decoded by the worker.
reference : Tensor | str | Path | None
Reference video (same accepted shapes as generated).
metrics : list[str] | str
Metric names, or "all".
device : str
PyTorch device string.
Source code in fastvideo/eval/api.py
fastvideo.eval.get_cache_dir
¶
get_cache_dir() -> Path
Eval cache root.
Layout::
get_cache_dir() / models / ← URL-fetched checkpoints (LAION head,
AMT, GRiT, …)
get_cache_dir() / torch / ← redirected ``TORCH_HOME`` (DINO etc.)
get_cache_dir() / clip / ← passed as ``download_root`` to
``clip.load(...)`` callsites
~/.cache/huggingface/hub / ← left at HF's default; widely shared
with other ML projects
Override priority: FASTVIDEO_EVAL_CACHE > ${FASTVIDEO_CACHE_ROOT}/eval.
Metric authors writing new code: when wrapping a third-party loader
that has its own cache convention (CLIP's download_root, pyiqa's
cache_dir, etc.), pass str(get_cache_dir() / "<library>") so
users get a single FASTVIDEO_EVAL_CACHE knob to redirect them all.
Source code in fastvideo/eval/models.py
fastvideo.eval.get_metric
¶
get_metric(name: str, **kwargs: Any) -> BaseMetric
Instantiate a registered metric by name.
Checks that optional dependencies are installed before instantiation and gives a clear install hint pointing at the right extra group.
Source code in fastvideo/eval/registry.py
fastvideo.eval.list_metrics
¶
fastvideo.eval.register
¶
register(name: str)
Decorator to register a metric class.
Usage::
@register("ssim")
class SSIMMetric(BaseMetric):
...
fastvideo.eval.samples_from
¶
samples_from(*, video: PathSpec | None = None, reference: PathSpec | None = None, audio: PathSpec | None = None, reference_audio: PathSpec | None = None, text_prompt: str | None = None, text_prompts: str | Path | list[str] | None = None, fps: float | None = None, auxiliary_info: dict | list[dict] | None = None, extras: dict | list[dict] | None = None, extract_audio: bool | str | Path = False, extract_workers: int = 4) -> list[dict]
Build a samples list from path-style inputs.
Parameters¶
video, reference, audio, reference_audio :
File path, directory of files (sorted by name), or any iterable
of paths. Pass whichever modalities apply to the metrics you
plan to run — they attach to sample["video"] /
sample["reference"] / sample["audio"] /
sample["reference_audio"] respectively. Video paths are
wrapped in :class:Video so :class:VideoPool decodes them
lazily in parallel; audio paths stay as strings (audio metrics
each load with their own resample / preprocess).
text_prompt :
A single prompt string broadcast onto every sample.
text_prompts :
A list of strings (one per sample), or a path to a .jsonl /
.json file containing per-sample prompts.
fps :
Scalar fps broadcast onto every sample.
auxiliary_info :
Single dict (broadcast) or list of dicts (zipped) for
sample["auxiliary_info"] — vbench structured-prompt
metrics read this.
extras :
Catch-all per-sample attachments. Use for metric-specific keys
the dedicated kwargs don't cover (scenario, view,
actions, calibration, reference_take2, ...). Pass
a single dict to broadcast or a list-of-dicts to zip; the keys
merge into each sample dict.
extract_audio :
If truthy, auto-extract audio from each video /
reference source into .wav files via PyAV and attach
the paths under sample["audio"] / sample["reference_audio"].
Pass a path for a persistent cache, True for a tempdir.
Skipped silently for videos with no audio stream; ignored
wherever audio / reference_audio is already explicit.
extract_workers :
Parallel workers for extract_audio.
Returns¶
list[dict]
Canonical samples shape — hand directly to
:meth:Evaluator.evaluate.
Cardinality and shape¶
Let N = len(generated inputs) (the agreed length of whichever
of video / audio you passed). References are attached
1:1 onto the first N samples; any extras (when |ref| > N)
become standalone role-tagged samples at the end of the list, so
set metrics like FVD see the full reference corpus while per-sample
paired metrics like LPIPS only run on the first N pairs.
Notes¶
"Missing" keys are simply absent from the sample dict. Metrics
handle them per their own contract (sample.get(...) for
optional, sample[...] raises for required, or
:meth:BaseMetric._skip for opt-in skip behavior). One fat samples
list with many keys can serve many metrics — each reads its subset.
Source code in fastvideo/eval/io/inputs.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | |
Modules¶
fastvideo.eval.api
¶
Classes¶
Functions¶
fastvideo.eval.api.evaluate
¶
evaluate(generated: Tensor | str | Path, reference: Tensor | str | Path | None = None, metrics: list[str] | str = 'all', device: str = 'cuda', **kwargs) -> dict[str, MetricResult] | list[dict[str, MetricResult]]
One-shot evaluation. For repeated use, prefer :func:create_evaluator.
Parameters¶
generated : Tensor | str | Path
Generated video. Either a pre-loaded (T, C, H, W) tensor or a
path to an mp4/avi/etc. — paths are decoded by the worker.
reference : Tensor | str | Path | None
Reference video (same accepted shapes as generated).
metrics : list[str] | str
Metric names, or "all".
device : str
PyTorch device string.
Source code in fastvideo/eval/api.py
fastvideo.eval.datasets
¶
Prompt-corpus datasets for end-to-end benchmark evaluation.
Public API mirrors :mod:fastvideo.eval (metrics side):
from fastvideo.eval.datasets import (
PromptDataset, Sample,
register_dataset, get_dataset, list_datasets,
)
A dataset is an iterable of plain dicts (one per sample). Built-in
datasets self-register at import time. To add one, drop a module into
this package that subclasses :class:PromptDataset and decorates with
@register_dataset("name") — auto-discovery picks it up.
Classes¶
fastvideo.eval.datasets.PromptDataset
¶
fastvideo.eval.datasets.Sample
¶
Bases: TypedDict
Documented schema for a row yielded by :class:PromptDataset.
Only prompt is required. Extra keys beyond these are forwarded to
the runner's eval-kwargs builder verbatim, so action-conditioned or
audio-bearing benchmarks can add their own fields without changing
the base class.
fastvideo.eval.datasets.VBenchPromptDataset
¶
Bases: PromptDataset
VBench prompts filtered by evaluation dimension.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dimensions
|
list[str] | str
|
List of dimension names, or |
'all'
|
full_info_path
|
str | Path | None
|
Optional override for |
None
|
A prompt that belongs to several requested dimensions is yielded once;
its dimensions list carries all matches so the scorer can route.
Source code in fastvideo/eval/datasets/vbench.py
Functions¶
fastvideo.eval.datasets.get_dataset
¶
Instantiate a registered dataset by name.
Source code in fastvideo/eval/datasets/registry.py
fastvideo.eval.datasets.list_datasets
¶
fastvideo.eval.datasets.register_dataset
¶
register_dataset(name: str)
Decorator to register a prompt-dataset class.
Usage::
@register_dataset("vbench")
class VBenchPromptDataset(BasePromptDataset):
...
Source code in fastvideo/eval/datasets/registry.py
Modules¶
fastvideo.eval.datasets.base
¶
Prompt-corpus datasets.
A :class:PromptDataset is an iterable of sample dicts describing the
prompts and conditions for a benchmark. Each sample is a plain dict —
no dataclass, no schema enforcement — that flows directly into both
generation (VideoGenerator.generate_video(**sample)) and scoring
(Evaluator.evaluate(**eval_kwargs)). The runner picks well-known
keys (prompt, n_samples, dimensions, auxiliary_info,
...) and passes the rest through.
This matches the surrounding FastVideo style:
- :class:
fastvideo.dataset.validation_dataset.ValidationDatasetyields dicts. - :meth:
fastvideo.VideoGenerator.generate_videoconsumes**kwargs. - :meth:
fastvideo.eval.Evaluator.evaluateconsumes**kwargs.
To add a new benchmark:
- Subclass :class:
PromptDataset, populateself._rowswith dicts in__init__. - Decorate with
@register_dataset("my_bench").
Convention for auxiliary_info: a flat dict of metric-keyed values
(e.g. {"color": "red"}). Benchmarks with nested aux schemas (VBench's
{dim: {key: val}}) flatten at load time so every consumer sees the
same shape.
Classes¶
fastvideo.eval.datasets.base.PromptDataset
¶ fastvideo.eval.datasets.base.Sample
¶
Bases: TypedDict
Documented schema for a row yielded by :class:PromptDataset.
Only prompt is required. Extra keys beyond these are forwarded to
the runner's eval-kwargs builder verbatim, so action-conditioned or
audio-bearing benchmarks can add their own fields without changing
the base class.
fastvideo.eval.datasets.physics_iq
¶
Physics-IQ benchmark prompt corpus.
Yields one sample dict per take-1 scenario, paired with its take-2
reference and both takes' real motion masks. Each row drops straight
into :meth:fastvideo.eval.Evaluator.evaluate for the physics_iq
metric:
{
"prompt": <description>,
"reference": "<take-1 mp4>",
"reference_take2": "<take-2 mp4>",
"reference_mask": "<take-1 mask mp4>",
"reference_take2_mask": "<take-2 mask mp4>",
"scenario": <scenario_id>,
"view": <camera view>,
"auxiliary_info": { ... metadata ... },
}
Self-contained dataset: the manifest CSV is vendored under
fastvideo/eval/metrics/physics_iq/_vendored/descriptions.csv;
per-scenario videos/masks/switch-frames auto-fetch on first use from the public
DeepMind bucket into ${FASTVIDEO_EVAL_CACHE}/datasets/physics_iq/.
Pass auto_download=False (or dataset_root= pointing at a
pre-downloaded copy) to opt out of network fetches.
Classes¶
fastvideo.eval.datasets.physics_iq.PhysicsIQPromptDataset
¶PhysicsIQPromptDataset(dataset_root: str | Path | None = None, *, fps: int = _DEFAULT_FPS, limit: int | None = None, generated_dir: str | Path | None = None, auto_download: bool = True)
Bases: PromptDataset
Physics-IQ benchmark prompt corpus.
Self-contained: get_dataset("physics_iq") works with no kwargs.
The manifest CSV is vendored next to the metric, and per-scenario
assets auto-fetch on first miss from the public bucket into
${FASTVIDEO_EVAL_CACHE}/datasets/physics_iq/.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_root
|
str | Path | None
|
path to a pre-downloaded copy of the Physics-IQ
release. Defaults to |
None
|
fps
|
int
|
target frame rate. The release ships at 30 FPS; other rates
transcode once on first access into |
_DEFAULT_FPS
|
limit
|
int | None
|
optional truncation for quick smoke runs. Apply this kwarg (not a post-construction slice) so we only fetch the assets for the scenarios actually requested. |
None
|
generated_dir
|
str | Path | None
|
optional directory of pre-generated videos —
attaches each manifest row's expected output path to the
sample dict under |
None
|
auto_download
|
bool
|
when True (the default), missing testing videos,
masks, and switch frames are fetched from the public bucket
into |
True
|
Source code in fastvideo/eval/datasets/physics_iq.py
fastvideo.eval.datasets.physics_iq.PhysicsIQScenario
dataclass
¶PhysicsIQScenario(scenario_id: str, view: str, scenario_name: str, take1_video_path: str, take2_video_path: str, switch_frame_path: str, caption: str, expected_gen_filename: str, generated_video_path: str | None = None, take1_mask_path: str | None = None, take2_mask_path: str | None = None)
One row of the Physics-IQ manifest, fully resolved on disk.
Functions¶
fastvideo.eval.datasets.registry
¶
Registry for prompt-corpus datasets, mirroring :mod:fastvideo.eval.registry.
Functions¶
fastvideo.eval.datasets.registry.get_dataset
¶Instantiate a registered dataset by name.
Source code in fastvideo/eval/datasets/registry.py
fastvideo.eval.datasets.registry.list_datasets
¶
fastvideo.eval.datasets.vbench
¶
VBench prompt corpus.
Single source of truth: upstream's VBench_full_info.json (946 entries,
each with prompt_en, a dimension list, optional auxiliary_info
keyed by dimension).
Classes¶
fastvideo.eval.datasets.vbench.VBenchPromptDataset
¶
Bases: PromptDataset
VBench prompts filtered by evaluation dimension.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dimensions
|
list[str] | str
|
List of dimension names, or |
'all'
|
full_info_path
|
str | Path | None
|
Optional override for |
None
|
A prompt that belongs to several requested dimensions is yielded once;
its dimensions list carries all matches so the scorer can route.
Source code in fastvideo/eval/datasets/vbench.py
Functions¶
fastvideo.eval.evaluator
¶
User-facing scorer.
Layering (mirrors FastVideo's VideoGenerator → Worker pattern, but in-process)::
Evaluator ← user-facing
└── EvalWorker × N ← single-GPU; owns metric replicas
└── VideoPool ← async path-→-tensor prefetch (per evaluate call)
The constructor builds one :class:EvalWorker per GPU and loads every
metric on every worker eagerly. :meth:evaluate is the single entry
point: pass kwargs for one sample, or pass a list of sample dicts to
fan-out across GPU replicas with pipelined decoding — same method,
return type follows the input shape.
Classes¶
fastvideo.eval.evaluator.Evaluator
¶
Evaluator(metrics: list[str] | str = 'all', device: str = 'cuda:0', num_gpus: int = 1, compile: bool = False, *, loader_threads: int = 1, prefetch_factor: int = 2, pre_upload: bool = True, skip_missing_deps: bool = False)
Pre-initialized scorer for repeated evaluation.
Parameters¶
metrics : list[str] | str
Metric names, group prefixes ("vbench"), or "all".
device : str
Single-GPU device (e.g. "cuda:0"). Ignored when num_gpus > 1.
num_gpus : int
Number of GPU replicas. Each gets its own :class:EvalWorker.
compile : bool
Apply :func:torch.compile to each metric's _model.
loader_threads : int
Background decode threads in the :class:VideoPool. Default 1
(hide decode behind compute). Bump for I/O-heavy benchmark sets
where one loader can't keep up with the workers.
prefetch_factor : int
pool max_size = prefetch_factor * num_workers. Default 2 —
one sample being consumed, one prefetched per worker.
pre_upload : bool
When True (default), the worker performs a single
host→device upload of video / reference per sample
before the metric loop, and every metric reads from that
shared GPU-resident tensor. Without it, each metric pays its
own .to(self.device) — N transfers of the same clip for N
metrics, which dominates at high resolution. Set False for
training-time eval, where keeping a clip resident on GPU
across the metric loop would fight the training step for VRAM.
skip_missing_deps : bool
When True, silently drop explicit metric names whose
optional deps aren't importable (with a one-line warning per
skipped metric). Default False — an explicit name with a
missing dep raises :class:ImportError at construction time.
Group selectors ("vbench", "all") always silent-skip
regardless of this flag.
Source code in fastvideo/eval/evaluator.py
Functions¶
fastvideo.eval.evaluator.Evaluator.evaluate
¶evaluate(samples: Iterable[dict] | None = None, *, metrics: list[str] | None = None, **kwargs) -> dict[str, MetricResult] | EvalResults
Score one sample (kwargs form) or many samples (list form).
Both forms go through the same :class:VideoPool pipeline;
video / reference paths are decoded asynchronously.
Parameters¶
samples :
Iterable of sample dicts. Omit and pass kwargs for a
single-sample call.
metrics :
Subset of this Evaluator's registered metrics to actually
run on this batch. None (default) runs all registered.
Lets a single long-lived Evaluator score different (gen,
ref) corpora with different metric subsets across multiple
evaluate() calls — e.g. LPIPS on a paired corpus, FVD
on an unequal-cardinality corpus — without burning model
loads. Set-metric accumulators are reset only for the
metrics included in metrics, so state for other set
metrics is preserved across calls.
Single sample::
ev.evaluate(video=tensor, text_prompt="...", fps=24.0)
Many samples::
ev.evaluate(samples=[{"video": ..., "reference": ...}, ...])
Many samples with a metric filter::
ev.evaluate(samples=lpips_samples, metrics=["common.lpips"])
ev.evaluate(samples=fvd_samples, metrics=["common.fvd"])
Returns¶
dict[str, MetricResult] for the single-sample form;
:class:EvalResults (list-of-dict subclass with .corpus) for
the list form.
Source code in fastvideo/eval/evaluator.py
fastvideo.eval.evaluator.Evaluator.release_cuda_memory
¶ fastvideo.eval.evaluator.Evaluator.reload
¶ fastvideo.eval.evaluator.Evaluator.shutdown
¶ fastvideo.eval.evaluator.Evaluator.unload
¶Functions¶
fastvideo.eval.io
¶
Classes¶
fastvideo.eval.io.NoAudioStreamError
¶
Bases: ValueError
Raised when a video file has no audio stream to extract.
Functions¶
fastvideo.eval.io.as_video
¶
Coerce path/tensor/Video → :class:Video for the pool to decode.
Path strings and :class:pathlib.Path become Video(source=str(x));
the pool then calls :func:load_video on first use. Tensors become
Video(source=None, frames=x) — the pool sees .frames already
populated and forwards untouched. :class:Video instances pass through.
Source code in fastvideo/eval/io/inputs.py
fastvideo.eval.io.build_eval_kwargs
¶
Build evaluator kwargs from a sample row + a video on disk.
Loads the video as (T,C,H,W) and adds the leading batch dim.
Forwards prompt (as scalar text_prompt) and
auxiliary_info (as scalar dict) when present on the row —
matches the one-sample-per-call contract that the evaluator and
every metric assume.
Source code in fastvideo/eval/io/paths.py
fastvideo.eval.io.default_filename
¶
fastvideo.eval.io.extract_audio_track
¶
extract_audio_track(video_path: str | Path, *, output_dir: str | Path, sample_rate: int | None = None, codec: str = 'pcm_s16le') -> Path
Pull the first audio stream from video_path to a .wav in output_dir.
Idempotent — if output_dir / {stem}.wav already exists, the
existing file is returned without re-decoding. This makes the
helper safe to call repeatedly from parallel workers on overlapping
inputs (the cache hit is the fast path).
Parameters¶
video_path :
Source video file (anything PyAV can open: mp4, mkv, webm, …).
output_dir :
Directory the .wav lands in. Created if it doesn't exist.
sample_rate :
If set, resample to this rate. Default (None) preserves the
source rate — most audio metrics resample again internally to
their own target rate, so adding a resample step here would just
be wasted work.
codec :
PCM codec for the output. pcm_s16le (default) gives 16-bit
little-endian PCM, the most broadly compatible WAV format.
Returns¶
Path
Absolute path to the extracted .wav.
Raises¶
NoAudioStreamError If video_path has no audio stream.
Source code in fastvideo/eval/io/audio.py
fastvideo.eval.io.extract_frames
¶
extract_frames(video: Tensor, n_frames: int | None = None) -> Tensor
Uniformly sample n_frames from a (T, C, H, W) video tensor.
Source code in fastvideo/eval/io/video.py
fastvideo.eval.io.glob_videos
¶
Find every generated video for row, sorted by trailing -<idx>.
Source code in fastvideo/eval/io/paths.py
fastvideo.eval.io.load_video
¶
Load a video as a (T, C, H, W) float32 tensor in [0, 1].
Supported source types:
- str / Path – path to
.mp4/.avi/.giffile, or a directory of frame images (sorted alphabetically). - torch.Tensor – returned as-is after shape validation.
- list[PIL.Image] – stacked into a tensor.
Source code in fastvideo/eval/io/video.py
fastvideo.eval.io.samples_from
¶
samples_from(*, video: PathSpec | None = None, reference: PathSpec | None = None, audio: PathSpec | None = None, reference_audio: PathSpec | None = None, text_prompt: str | None = None, text_prompts: str | Path | list[str] | None = None, fps: float | None = None, auxiliary_info: dict | list[dict] | None = None, extras: dict | list[dict] | None = None, extract_audio: bool | str | Path = False, extract_workers: int = 4) -> list[dict]
Build a samples list from path-style inputs.
Parameters¶
video, reference, audio, reference_audio :
File path, directory of files (sorted by name), or any iterable
of paths. Pass whichever modalities apply to the metrics you
plan to run — they attach to sample["video"] /
sample["reference"] / sample["audio"] /
sample["reference_audio"] respectively. Video paths are
wrapped in :class:Video so :class:VideoPool decodes them
lazily in parallel; audio paths stay as strings (audio metrics
each load with their own resample / preprocess).
text_prompt :
A single prompt string broadcast onto every sample.
text_prompts :
A list of strings (one per sample), or a path to a .jsonl /
.json file containing per-sample prompts.
fps :
Scalar fps broadcast onto every sample.
auxiliary_info :
Single dict (broadcast) or list of dicts (zipped) for
sample["auxiliary_info"] — vbench structured-prompt
metrics read this.
extras :
Catch-all per-sample attachments. Use for metric-specific keys
the dedicated kwargs don't cover (scenario, view,
actions, calibration, reference_take2, ...). Pass
a single dict to broadcast or a list-of-dicts to zip; the keys
merge into each sample dict.
extract_audio :
If truthy, auto-extract audio from each video /
reference source into .wav files via PyAV and attach
the paths under sample["audio"] / sample["reference_audio"].
Pass a path for a persistent cache, True for a tempdir.
Skipped silently for videos with no audio stream; ignored
wherever audio / reference_audio is already explicit.
extract_workers :
Parallel workers for extract_audio.
Returns¶
list[dict]
Canonical samples shape — hand directly to
:meth:Evaluator.evaluate.
Cardinality and shape¶
Let N = len(generated inputs) (the agreed length of whichever
of video / audio you passed). References are attached
1:1 onto the first N samples; any extras (when |ref| > N)
become standalone role-tagged samples at the end of the list, so
set metrics like FVD see the full reference corpus while per-sample
paired metrics like LPIPS only run on the first N pairs.
Notes¶
"Missing" keys are simply absent from the sample dict. Metrics
handle them per their own contract (sample.get(...) for
optional, sample[...] raises for required, or
:meth:BaseMetric._skip for opt-in skip behavior). One fat samples
list with many keys can serve many metrics — each reads its subset.
Source code in fastvideo/eval/io/inputs.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | |
fastvideo.eval.io.sanitize_prompt
¶
Modules¶
fastvideo.eval.io.audio
¶
Audio-extraction helper for video files.
Audio metrics (audio.clap_score, audio.frechet_distance, …) read
file paths under sample["audio"] and do their own per-metric
preprocessing (CLAP at 48 kHz, PaSST at 32 kHz, whisper at 16 kHz mono,
…). When the source video carries an audio track (V2A / T2A-V model
outputs), :func:samples_from(extract_audio=True) calls
:func:extract_audio_track to pull a .wav next to it once per video.
Why a separate utility instead of in-pool decode: every audio metric wants a different sample rate / channel count / format, so the pool can't usefully pre-load audio into a single canonical tensor the way it pre-loads video frames. Paths-in / paths-out is the lingua franca for audio in this codebase.
Classes¶
fastvideo.eval.io.audio.NoAudioStreamError
¶
Bases: ValueError
Raised when a video file has no audio stream to extract.
Functions¶
fastvideo.eval.io.audio.extract_audio_track
¶extract_audio_track(video_path: str | Path, *, output_dir: str | Path, sample_rate: int | None = None, codec: str = 'pcm_s16le') -> Path
Pull the first audio stream from video_path to a .wav in output_dir.
Idempotent — if output_dir / {stem}.wav already exists, the
existing file is returned without re-decoding. This makes the
helper safe to call repeatedly from parallel workers on overlapping
inputs (the cache hit is the fast path).
Parameters¶
video_path :
Source video file (anything PyAV can open: mp4, mkv, webm, …).
output_dir :
Directory the .wav lands in. Created if it doesn't exist.
sample_rate :
If set, resample to this rate. Default (None) preserves the
source rate — most audio metrics resample again internally to
their own target rate, so adding a resample step here would just
be wasted work.
codec :
PCM codec for the output. pcm_s16le (default) gives 16-bit
little-endian PCM, the most broadly compatible WAV format.
Returns¶
Path
Absolute path to the extracted .wav.
Raises¶
NoAudioStreamError If video_path has no audio stream.
Source code in fastvideo/eval/io/audio.py
fastvideo.eval.io.inputs
¶
Input-shape helpers: paths → samples list.
The :class:Evaluator and every metric consume the same internal
representation — a list[dict] with one dict per sample. Building
that list by hand is the largest source of ceremony in user scripts.
:func:samples_from is a pure function that turns path-style inputs
(generated videos / audio, optional paired references, optional
per-sample prompts / fps / metadata) into the canonical samples list,
ready to hand to :meth:Evaluator.evaluate.
Design rules:
- No Evaluator state, no metric introspection, no I/O beyond reading the prompts file when given. Just dict assembly.
- "Extra" keys on a sample are free — metrics each read what they need and ignore the rest. So one fat samples list naturally serves many metrics in one Evaluator.
- No "primary modality" axis. Each modality is a named kwarg
(
video=,audio=); pass whichever you have. Both is fine. - No
modeaxis. The shape of the output is determined by cardinality: when|gen| == |ref|the samples list is pair-zipped; when|gen| < |ref|the unmatched references become role-tagged set samples for corpus-shaped metrics (FVD / FAD) without disturbing per-sample paired metrics (LPIPS / PSNR / SSIM / gt_optical_flow).
Classes¶
Functions¶
fastvideo.eval.io.inputs.as_video
¶Coerce path/tensor/Video → :class:Video for the pool to decode.
Path strings and :class:pathlib.Path become Video(source=str(x));
the pool then calls :func:load_video on first use. Tensors become
Video(source=None, frames=x) — the pool sees .frames already
populated and forwards untouched. :class:Video instances pass through.
Source code in fastvideo/eval/io/inputs.py
fastvideo.eval.io.inputs.samples_from
¶samples_from(*, video: PathSpec | None = None, reference: PathSpec | None = None, audio: PathSpec | None = None, reference_audio: PathSpec | None = None, text_prompt: str | None = None, text_prompts: str | Path | list[str] | None = None, fps: float | None = None, auxiliary_info: dict | list[dict] | None = None, extras: dict | list[dict] | None = None, extract_audio: bool | str | Path = False, extract_workers: int = 4) -> list[dict]
Build a samples list from path-style inputs.
Parameters¶
video, reference, audio, reference_audio :
File path, directory of files (sorted by name), or any iterable
of paths. Pass whichever modalities apply to the metrics you
plan to run — they attach to sample["video"] /
sample["reference"] / sample["audio"] /
sample["reference_audio"] respectively. Video paths are
wrapped in :class:Video so :class:VideoPool decodes them
lazily in parallel; audio paths stay as strings (audio metrics
each load with their own resample / preprocess).
text_prompt :
A single prompt string broadcast onto every sample.
text_prompts :
A list of strings (one per sample), or a path to a .jsonl /
.json file containing per-sample prompts.
fps :
Scalar fps broadcast onto every sample.
auxiliary_info :
Single dict (broadcast) or list of dicts (zipped) for
sample["auxiliary_info"] — vbench structured-prompt
metrics read this.
extras :
Catch-all per-sample attachments. Use for metric-specific keys
the dedicated kwargs don't cover (scenario, view,
actions, calibration, reference_take2, ...). Pass
a single dict to broadcast or a list-of-dicts to zip; the keys
merge into each sample dict.
extract_audio :
If truthy, auto-extract audio from each video /
reference source into .wav files via PyAV and attach
the paths under sample["audio"] / sample["reference_audio"].
Pass a path for a persistent cache, True for a tempdir.
Skipped silently for videos with no audio stream; ignored
wherever audio / reference_audio is already explicit.
extract_workers :
Parallel workers for extract_audio.
Returns¶
list[dict]
Canonical samples shape — hand directly to
:meth:Evaluator.evaluate.
Cardinality and shape¶
Let N = len(generated inputs) (the agreed length of whichever
of video / audio you passed). References are attached
1:1 onto the first N samples; any extras (when |ref| > N)
become standalone role-tagged samples at the end of the list, so
set metrics like FVD see the full reference corpus while per-sample
paired metrics like LPIPS only run on the first N pairs.
Notes¶
"Missing" keys are simply absent from the sample dict. Metrics
handle them per their own contract (sample.get(...) for
optional, sample[...] raises for required, or
:meth:BaseMetric._skip for opt-in skip behavior). One fat samples
list with many keys can serve many metrics — each reads its subset.
Source code in fastvideo/eval/io/inputs.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | |
fastvideo.eval.io.paths
¶
Filesystem helpers shared by eval scripts.
Provides the prompt-sanitization, default filename convention, and
(row, video_path) → eval-kwargs builder. Free functions, not a
class — :class:fastvideo.eval.Evaluator is the only stateful object
in the eval surface; loops live in user scripts.
Functions¶
fastvideo.eval.io.paths.build_eval_kwargs
¶Build evaluator kwargs from a sample row + a video on disk.
Loads the video as (T,C,H,W) and adds the leading batch dim.
Forwards prompt (as scalar text_prompt) and
auxiliary_info (as scalar dict) when present on the row —
matches the one-sample-per-call contract that the evaluator and
every metric assume.
Source code in fastvideo/eval/io/paths.py
fastvideo.eval.io.paths.default_filename
¶ fastvideo.eval.io.paths.glob_videos
¶Find every generated video for row, sorted by trailing -<idx>.
Source code in fastvideo/eval/io/paths.py
fastvideo.eval.io.paths.sanitize_prompt
¶
fastvideo.eval.io.video
¶
Functions¶
fastvideo.eval.io.video.extract_frames
¶extract_frames(video: Tensor, n_frames: int | None = None) -> Tensor
Uniformly sample n_frames from a (T, C, H, W) video tensor.
Source code in fastvideo/eval/io/video.py
fastvideo.eval.io.video.load_video
¶Load a video as a (T, C, H, W) float32 tensor in [0, 1].
Supported source types:
- str / Path – path to
.mp4/.avi/.giffile, or a directory of frame images (sorted alphabetically). - torch.Tensor – returned as-is after shape validation.
- list[PIL.Image] – stacked into a tensor.
Source code in fastvideo/eval/io/video.py
fastvideo.eval.metrics
¶
Auto-discover and register all built-in metrics (recursive).
Walks the metrics package tree and imports any leaf-package's
metric module so @register decorators fire. Path components
starting with _ are skipped (used for shared helpers like
optical_flow/_shared.py and vbench/_grit_helper.py).
Modules¶
fastvideo.eval.metrics.audio
¶
Modules¶
fastvideo.eval.metrics.audio.audiobox_aesthetics
¶ fastvideo.eval.metrics.audio.audiobox_aesthetics.metric
¶AudioBox Aesthetics (CE, CU, PC, PQ).
Thin wrapper around Meta's audiobox_aesthetics predictor. Returns
four per-clip dimensions (CE — Content Enjoyment, CU — Content
Usefulness, PC — Production Complexity, PQ — Production Quality);
score exposes PQ, the dimension V2A papers typically report on.
The remaining three are surfaced under details.
The earlier Verse-Bench combined score (CE + CU + PQ + (11 − PC)) / 4
is non-standard and is deliberately not used.
fastvideo.eval.metrics.audio.audiobox_aesthetics.metric.AudioBoxAestheticsMetric
¶
Bases: BaseMetric
AudioBox Aesthetics: PQ as the primary score, CE/CU/PC/PQ in details.
Source code in fastvideo/eval/metrics/audio/audiobox_aesthetics/metric.py
fastvideo.eval.metrics.audio.clap_score
¶ fastvideo.eval.metrics.audio.clap_score.metric
¶CLAP Score (CS) — text-audio cosine similarity.
Uses HuggingFace transformers.ClapModel with the
laion/clap-htsat-fused checkpoint, the closest HF mirror of the
630k-audioset-fusion-best.pt weights that
hkchengrex/av-benchmark and the V2A literature use. The fully
byte-exact comparison still requires the laion_clap pip package
(which pins numpy<2 and forces a fastvideo-wide downgrade — so
we don't depend on it). Numbers from this metric are in the same
ballpark as av-benchmark's LAION_CLAP.
fastvideo.eval.metrics.audio.clap_score.metric.ClapScoreMetric
¶ClapScoreMetric(model_name: str = DEFAULT_CLAP_REPO)
Bases: BaseMetric
CLAP score — text-audio cosine similarity via HF CLAP.
Source code in fastvideo/eval/metrics/audio/clap_score/metric.py
fastvideo.eval.metrics.audio.desync
¶ fastvideo.eval.metrics.audio.desync.metric
¶Synchformer DeSync — average audio-video desynchronization in seconds.
Per-sample. Ports hkchengrex/av-benchmark's DeSync against the
24-01-04T16-39-21 Synchformer checkpoint (the one MMAudio reports
against). argmax over the 21-class grid in [-2, +2] seconds for
the first 14 and last 14 segments separately; lower is better, 0 = ideal.
Synchformer's transformer carries a fixed positional embedding sized for
~14 segments per modality, so keep input clips in the 3-10 s range —
shorter clips raise in _segment_video and longer clips ignore the
middle.
fastvideo.eval.metrics.audio.desync.metric.DeSyncMetric
¶DeSyncMetric(src_fps: float | None = None)
Bases: BaseMetric
Per-sample Synchformer DeSync magnitude in seconds (lower is better).
Source code in fastvideo/eval/metrics/audio/desync/metric.py
fastvideo.eval.metrics.audio.frechet_distance
¶ fastvideo.eval.metrics.audio.frechet_distance.metric
¶Frechet Audio Distance over PaSST embeddings (FD_PaSST).
Corpus-vs-corpus Fréchet distance between Gaussian moments of two PaSST
768-d embedding sets. Ports av_bench.metrics.fad.compute_fd 1:1.
References are supplied per-sample via reference_audio /
role="reference", or once from a .pt cache at
$FASTVIDEO_FAD_REF_FEATURES. Skips when either side has fewer than
two finite embeddings.
fastvideo.eval.metrics.audio.frechet_distance.metric.FrechetAudioDistanceMetric
¶
Bases: BaseMetric
Corpus-vs-corpus Frechet Audio Distance with PaSST embeddings.
Source code in fastvideo/eval/metrics/audio/frechet_distance/metric.py
fastvideo.eval.metrics.audio.imagebind_score
¶ fastvideo.eval.metrics.audio.imagebind_score.metric
¶ImageBind audio↔video cosine similarity (IB-Score).
Per-sample. Reads sample["video_path"] (or sample["video"].source
for a :class:Video wrapper) and sample["audio"]; ImageBind decodes
its own clips so the path is required, not the pool-decoded tensor.
fastvideo.eval.metrics.audio.imagebind_score.metric.ImageBindScoreMetric
¶
Bases: BaseMetric
ImageBind audio↔video cosine similarity, per-sample.
Source code in fastvideo/eval/metrics/audio/imagebind_score/metric.py
fastvideo.eval.metrics.audio.kl_divergence
¶ fastvideo.eval.metrics.audio.kl_divergence.metric
¶PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.
Ports av_bench.metrics.kl.compute_kl 1:1. The primary score is
the softmax variant (reported as "MKL" / "KL_PaSST" by V2A papers);
the sigmoid variant is exposed in details["kl_sigmoid"].
fastvideo.eval.metrics.audio.kl_divergence.metric.KLDivergenceMetric
¶
Bases: BaseMetric
PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.
Per-sample. Requires sample["audio"] (generated) and
sample["reference_audio"] (ground truth).
Source code in fastvideo/eval/metrics/audio/kl_divergence/metric.py
fastvideo.eval.metrics.audio.wer
¶ fastvideo.eval.metrics.audio.wer.metric
¶Word Error Rate for generated audio.
Per-sample. Default backend is Whisper-base; glm_asr (vendored under
third_party/eval/glmasr/) and sensevoice (FunASR, install-on-
demand) are selectable via asr_backend=. Text is normalized per the
MagiHuman convention and CJK inputs are scored at the character level.
fastvideo.eval.metrics.audio.wer.metric.WERMetric
¶WERMetric(asr_backend: str = 'whisper', model_name: str | None = None, instruction: str = 'Please transcribe this audio into text')
Bases: BaseMetric
WER metric with selectable ASR backend.
Sample fields: audio, reference_text (or text_prompt),
optional language. instruction is passed to GLM-ASR.
Source code in fastvideo/eval/metrics/audio/wer/metric.py
fastvideo.eval.metrics.base
¶
Classes¶
fastvideo.eval.metrics.base.BaseMetric
¶Abstract base class for all eval metrics.
Two execution shapes:
- Per-sample (
is_set_metric=False, default) — implement :meth:compute. The Evaluator calls it once per input sample and returns one :class:MetricResultper sample.
- Set-vs-set (
is_set_metric=True) — implement :meth:accumulate(called once per sample to buffer features) and :meth:finalize(called once after all samples to compute the corpus-level result). Use :meth:resetto clear buffers and :meth:merge_fromto fold multi-GPU per-worker state together.
Optionally override :meth:setup to eagerly load models. Metrics
that chunk along the time dim for memory hardcode their own chunk
size in __init__ (see optical_flow for the canonical
example). Eval always processes one video per
:meth:Evaluator.evaluate call; compute / accumulate
receive a single sample, not a batch.
Source code in fastvideo/eval/metrics/base.py
fastvideo.eval.metrics.base.BaseMetric.compute
¶compute(sample: dict) -> MetricResult
Per-sample metrics: compute the score for one sample.
sample["video"] is (T, C, H, W) float in [0, 1].
sample["reference"] (if used) has the same shape. Return
self._skip(sample, reason) for missing inputs.
Source code in fastvideo/eval/metrics/base.py
fastvideo.eval.metrics.base.BaseMetric.finalize
¶finalize() -> MetricResult
fastvideo.eval.metrics.base.BaseMetric.merge_from
¶merge_from(other: BaseMetric) -> None
fastvideo.eval.metrics.base.BaseMetric.reset
¶ fastvideo.eval.metrics.base.BaseMetric.setup
¶Eagerly load models. Called once by :class:EvalWorker.
Default is a no-op; metrics with no eager state (pixel math, closed-form ops) inherit this. Override only if your metric needs to load weights.
Source code in fastvideo/eval/metrics/base.py
fastvideo.eval.metrics.base.BaseMetric.to
¶to(device: str | device) -> BaseMetric
fastvideo.eval.metrics.common
¶
Modules¶
fastvideo.eval.metrics.common.fvd
¶ fastvideo.eval.metrics.common.fvd.extractors
¶Pluggable video feature extractors for FVD.
Three extractors, sharing the _BaseExtractor contract:
i3d— Kinetics-400 I3D (TorchScript,flateon/FVD-I3D-torchscript). The standard FVD feature space used in the literature.clip— CLIP ViT-B/32 per-frame embeddings, mean-pooled over time. Captures semantic / content quality.videomae— VideoMAE-base last-hidden-state, mean-pooled over patch tokens. Captures structural / motion quality.
The contract is intentionally narrow: each extractor takes a
(B, T, C, H, W) float tensor in [0, 1] and returns (B, D) numpy
features. Preprocessing (resize, normalize, layout) is the extractor's job;
its callers should not care.
fastvideo.eval.metrics.common.fvd.extractors.load_extractor
¶load_extractor(name: str, device: device) -> _BaseExtractor
Instantiate the named extractor on device. Raises ValueError on unknown names.
Source code in fastvideo/eval/metrics/common/fvd/extractors.py
fastvideo.eval.metrics.common.fvd.metric
¶common.fvd — Fréchet Video Distance.
Measures distributional similarity between generated and reference videos using a pluggable feature backbone (I3D / CLIP / VideoMAE). Lower score → generated videos are closer to the real video distribution.
FVD is a dataset-level metric — it compares the distribution of a collection of videos rather than scoring each video individually. For statistically reliable results at least 256 videos are recommended; the standard protocol uses 2048.
This metric follows the set-vs-set protocol (is_set_metric=True):
- :meth:accumulate is called once per video to buffer features. Routes
on sample["role"] — "reference" → real buffer, anything else →
generated buffer. This is the same convention
:class:audio.frechet_distance.FrechetAudioDistanceMetric uses.
- :meth:finalize is called once after all videos to compute FVD.
- :meth:reset clears both buffers between evaluation runs.
To score two corpora, build a samples list with
:func:fastvideo.eval.io.inputs.samples_from and hand it to
:meth:Evaluator.evaluate — the path-friendly form is::
from fastvideo.eval import create_evaluator, samples_from
ev = create_evaluator(metrics=["common.fvd"], device="cuda:0")
result = ev.evaluate(samples=samples_from(
video="gen/", reference="ref/",
)).corpus["common.fvd"]
Extractors¶
i3d(default) — Kinetics-400 I3D, the standard FVD feature space used in the literature.clip— CLIP ViT-B/32 per-frame embeds, mean-pooled over time. Captures semantic / content quality.videomae— VideoMAE-base last-hidden-state, mean-pooled over patches. Captures structural / motion quality.
CLIP and VideoMAE are research-grade and not directly comparable to FVD
scores from the literature; use i3d for paper comparisons.
Reference features¶
Two ways to supply reference features:
- Stream: pass
reference=to :func:samples_from. Equal cardinality emits paired samples (sample["video"]+sample["reference"]); unequal cardinality role-tags the unmatched references.accumulateroutes both shapes correctly. Ifcache_mode="read_write"(default) and no cache exists, the extracted features are persisted to cache_path on :meth:finalize. - Cache hit: a prior run wrote
cache_path; :meth:setuploads it automatically and streamed references are unnecessary.
Streamed references always win over the cache when both are present.
Cache resolution order (first match wins):
1. cache_path= constructor kwarg, if set.
2. $FASTVIDEO_FVD_REF_FEATURES, if set.
3. ${FASTVIDEO_EVAL_CACHE}/fvd/real_features_{extractor}.pt (default).
The env-var override mirrors audio.frechet_distance's
FASTVIDEO_FAD_REF_FEATURES pattern.
fastvideo.eval.metrics.common.fvd.metric.FVDMetric
¶FVDMetric(extractor: str = 'i3d', cache_path: str | None = None, cache_mode: str = 'read_write', chunk_size: int = 32)
Bases: BaseMetric
Fréchet Video Distance (FVD) over a pluggable feature backbone.
Set-vs-set metric (is_set_metric=True). The Evaluator calls
:meth:accumulate once per video and :meth:finalize once after
all videos have been processed.
For meaningful scores, evaluate over ≥ 256 videos (2048 is the standard protocol used in the literature).
Parameters¶
extractor : str
Feature backbone name. One of "i3d" (default), "clip",
"videomae". See module docstring for guidance.
cache_path : str, optional
Where extracted reference features are cached. Resolution order:
constructor kwarg → $FASTVIDEO_FVD_REF_FEATURES env-var →
${FASTVIDEO_EVAL_CACHE}/fvd/real_features_{extractor}.pt
(default, resolved at :meth:setup time so the env-var can be set
after import).
cache_mode : str
"read_write" (default) — read at setup, write at finalize on
cache miss. "read" — read at setup, never write. "off" —
ignore the cache entirely. Auto-write happens only when the cache
was missing AND reference features streamed in this run.
chunk_size : int
Videos per forward pass. Reduce if GPU runs OOM.
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.accumulate
¶accumulate(sample: dict) -> None
Extract features from one video and route by sample["role"].
Routing rules:
role="reference"→ real buffer; the reference key is ignored.- Otherwise → generated buffer. If
sample["reference"]is also present, its features also go into the real buffer — this lets a paired samples list (the shape per-sample metrics like LPIPS need) feed FVD in the same pass without a second decode.
Accepts either a raw tensor or a populated :class:Video instance
under either key.
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.finalize
¶finalize() -> MetricResult
Compute FVD from buffered generated features vs. reference features.
Reference source priority: streamed (_real_buf) > disk cache
(_cached_real). On a fresh cache + streamed refs, write the
accumulated real features back to disk if cache_mode="read_write".
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.merge_from
¶merge_from(other: BaseMetric) -> None
Fold another worker's accumulated features into this one (multi-GPU).
Both gen and ref buffers concatenate. The disk cache is the same
across workers, so we only adopt other's _cached_real if our
own slot is empty (defensive — should always already match).
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.reset
¶
fastvideo.eval.metrics.optical_flow
¶
Modules¶
fastvideo.eval.metrics.optical_flow.gt_optical_flow
¶ fastvideo.eval.metrics.optical_flow.gt_optical_flow.metric
¶Compare optical flow extracted from a generated video against optical flow extracted from a ground-truth reference video.
Both flows are produced by the same ptlflow model (default
dpflow/things). The resulting per-pixel / per-frame / temporal
metric set is identical to synthetic_optical_flow — only the way the
reference flow is constructed differs.
fastvideo.eval.metrics.optical_flow.gt_optical_flow.metric.GtOpticalFlowMetric
¶GtOpticalFlowMetric(model_name: str = 'dpflow', ckpt: str = 'things', min_mag: float = 0.5, max_mag_pct: float = 80.0, grid_size: int = 8)
Bases: BaseMetric
Per-pixel / per-frame / temporal flow comparison vs. a reference video.
The headline score is pixel_epe_mean_mean (lower is better);
every other scalar lives in details so downstream consumers can
pick whichever one they care about.
Source code in fastvideo/eval/metrics/optical_flow/gt_optical_flow/metric.py
fastvideo.eval.metrics.optical_flow.synthetic_optical_flow
¶ fastvideo.eval.metrics.optical_flow.synthetic_optical_flow.metric
¶Compare optical flow extracted from a generated video against optical flow synthesized analytically from per-frame actions.
The reference flow is not observed from a ground-truth video — it's
predicted from the action stream via a third-person camera-kinematics
model (Longuet-Higgins linearization + off-pivot translation correction;
no depth). Observed flow comes from the same ptlflow model used by
gt_optical_flow, and the two are compared with the identical metric
set, so scores are directly comparable across the two metrics.
Required sample keys¶
video
(B, T, C, H, W) float in [0, 1].
actions
dict (or list-of-dicts of length B) with two np.ndarray keys:
* ``keyboard`` of shape ``(T, 6)`` — ``[W, S, A, D, turn_left, turn_right]``
* ``mouse`` of shape ``(T, 2)`` — ``[pitch, yaw]``
calibration
Either a path to a ThirdPersonCalibration JSON file, or a dict
of fitted parameters. May also be set once at construction time via
calibration_path= and reused across samples.
Optional sample keys¶
mouse_pitch_sign
+1 (default) or -1 if the dataset's mouse-pitch sign is
flipped (mhuo's data carries this in metadata).
fastvideo.eval.metrics.optical_flow.synthetic_optical_flow.metric.SyntheticOpticalFlowMetric
¶SyntheticOpticalFlowMetric(model_name: str = 'dpflow', ckpt: str = 'things', calibration_path: str | Path | None = None, min_mag: float = 0.5, max_mag_pct: float = 80.0, grid_size: int = 8)
Bases: BaseMetric
Action-driven synthetic flow vs. video-extracted observed flow.
Pass calibration_path at construction to bind the calibration
once across all samples; otherwise supply sample["calibration"]
per call. Missing actions or calibration produce a skipped result
(score=None) rather than raising.
Source code in fastvideo/eval/metrics/optical_flow/synthetic_optical_flow/metric.py
fastvideo.eval.metrics.physics_iq
¶
Modules¶
fastvideo.eval.metrics.physics_iq.utils
¶ fastvideo.eval.metrics.physics_iq.utils.prepare_pair
¶prepare_pair(sample: dict[str, Any], *, prep_kwargs: dict[str, Any] | None = None) -> PreparedPhysicsIQPair
Resolve a sample into a prepared (gen, ref) pair.
Caches the result on sample['_physics_iq_pair'] so other physics_iq
sub-metrics on the same sample reuse it instead of re-decoding.
Source code in fastvideo/eval/metrics/physics_iq/utils.py
fastvideo.eval.metrics.vbench
¶
VBench metrics. Bootstraps upstream submodule on sys.path and installs runtime compat shims for modern torch/transformers/numpy/timm.
The upstream vbench source lives as a git submodule at
fastvideo/third_party/eval/vbench (pinned to a specific
Vchitect/VBench SHA). We do not pip-install it — we only need its
Python modules importable. Its runtime deps (clip, transformers, etc.)
are already in FastVideo's main env.
Compat with modern dependency versions is achieved at import time, in this file, instead of via on-disk patches to upstream files. Each shim below corresponds to a specific drift between vbench's pinned-2023 deps and FastVideo's current pins. Adding a new shim is preferable to editing the submodule.
Modules¶
fastvideo.eval.metrics.vbench.aesthetic_quality
¶ fastvideo.eval.metrics.vbench.appearance_style
¶ fastvideo.eval.metrics.vbench.background_consistency
¶ fastvideo.eval.metrics.vbench.dynamic_degree
¶ fastvideo.eval.metrics.vbench.dynamic_degree.metric
¶VBench Dynamic Degree — RAFT optical flow motion detection.
For each consecutive frame pair, computes optical flow via RAFT and takes the mean of the top 5% flow magnitudes. If enough pairs exceed an adaptive threshold, the video is classified as dynamic (1.0) vs static (0.0).
fastvideo.eval.metrics.vbench.dynamic_degree.metric.DynamicDegreeMetric
¶ fastvideo.eval.metrics.vbench.human_action
¶ fastvideo.eval.metrics.vbench.imaging_quality
¶ fastvideo.eval.metrics.vbench.motion_smoothness
¶ fastvideo.eval.metrics.vbench.motion_smoothness.metric
¶VBench Motion Smoothness — AMT-S frame interpolation quality.
Takes every-other frame, uses AMT-S to interpolate the missing middle frames, then compares interpolated vs actual frames. Score = (255 - mean_pixel_diff) / 255. Higher = smoother motion.
fastvideo.eval.metrics.vbench.multiple_objects
¶ fastvideo.eval.metrics.vbench.object_class
¶ fastvideo.eval.metrics.vbench.overall_consistency
¶ fastvideo.eval.metrics.vbench.scene
¶ fastvideo.eval.metrics.vbench.scene.metric
¶VLM-based scene matching using AVoCaDO (Qwen2.5-Omni).
Replaces VBench's Tag2Text-based scene metric with a modern VLM caption. The algorithm follows VBench: 1. Caption the video 2. Check if all scene keywords appear in the caption 3. Score = 1.0 if all match, 0.0 otherwise
Unlike VBench (which captions each frame separately with Tag2Text), AVoCaDO captions the entire video in one pass with rich natural language, making the keyword check more robust.
fastvideo.eval.metrics.vbench.spatial_relationship
¶ fastvideo.eval.metrics.vbench.subject_consistency
¶ fastvideo.eval.metrics.vbench.temporal_flickering
¶ fastvideo.eval.metrics.vbench.temporal_style
¶ fastvideo.eval.metrics.vbench.temporal_style.metric
¶VBench Temporal Style — ViCLIP text-video alignment (style focus).
Identical logic to overall_consistency — same ViCLIP cosine similarity. The difference is semantic: overall_consistency measures general prompt alignment while temporal_style measures style consistency over time. VBench uses different prompts for each from its metadata JSON.
fastvideo.eval.metrics.videoscore2
¶
Modules¶
fastvideo.eval.metrics.videoscore2.metric
¶VideoScore2 — VLM-based video quality scoring.
Uses a Qwen2.5-VL model fine-tuned to score generated videos on three
dimensions: visual quality, text-to-video alignment, and physical
consistency. Scores are extracted from token logits as upstream's
ll_based_soft_score_normed weighting (1-5 scale).
Reference: TIGER-AI-Lab/VideoScore2 (vs2_inference.py).
fastvideo.eval.metrics.videoscore2.metric.VideoScore2Metric
¶VideoScore2Metric(model_name: str = 'TIGER-Lab/VideoScore2', infer_fps: float = 2.0, max_tokens: int = 1024, temperature: float = 0.7, do_sample: bool = True)
Bases: BaseMetric
VideoScore2: VLM-based video quality scoring (3 dimensions).
Requires sample["text_prompt"] for text-to-video alignment.
Supports batched generation for GPU efficiency.
Source code in fastvideo/eval/metrics/videoscore2/metric.py
fastvideo.eval.models
¶
Model checkpoint resolution and caching for eval metrics.
Single public function — :func:ensure_checkpoint — that hands a metric
a local path to its weights, downloading on miss. Three source kinds:
- an existing local path → returned as-is;
- an HTTP(S) URL → downloaded to
get_cache_dir() / nameviahuggingface_hub.http_get(resume + retries built in); - an HF repo id (
"org/repo") → :func:huggingface_hub.hf_hub_downloadif filename is given, else :func:huggingface_hub.snapshot_download. name is ignored in HF mode — HF manages its own cache key under~/.cache/huggingface/hub.
All download paths are filelock-safe (cooperative across threads,
processes, and SLURM ranks) via :func:fastvideo.utils.get_lock.
Functions¶
fastvideo.eval.models.ensure_checkpoint
¶
Resolve a model checkpoint path, downloading on miss.
See module docstring for the full source contract. name is used only as the local cache filename for URL sources; ignored otherwise.
Source code in fastvideo/eval/models.py
fastvideo.eval.models.get_cache_dir
¶
get_cache_dir() -> Path
Eval cache root.
Layout::
get_cache_dir() / models / ← URL-fetched checkpoints (LAION head,
AMT, GRiT, …)
get_cache_dir() / torch / ← redirected ``TORCH_HOME`` (DINO etc.)
get_cache_dir() / clip / ← passed as ``download_root`` to
``clip.load(...)`` callsites
~/.cache/huggingface/hub / ← left at HF's default; widely shared
with other ML projects
Override priority: FASTVIDEO_EVAL_CACHE > ${FASTVIDEO_CACHE_ROOT}/eval.
Metric authors writing new code: when wrapping a third-party loader
that has its own cache convention (CLIP's download_root, pyiqa's
cache_dir, etc.), pass str(get_cache_dir() / "<library>") so
users get a single FASTVIDEO_EVAL_CACHE knob to redirect them all.
Source code in fastvideo/eval/models.py
fastvideo.eval.pool
¶
Async path-→-tensor prefetcher for the Evaluator.
Hides video-decode latency behind metric compute by running a small
thread pool of decoders that fill a bounded queue. One :class:VideoPool
is owned by the Evaluator per evaluate(samples=...) call; workers
consume via :meth:VideoPool.get. Decode order is non-deterministic;
each yielded item carries its original input index so consumers can
write back into a result list in input order.
Pool sizing: max_size = prefetch_factor * num_workers.
Classes¶
fastvideo.eval.pool.VideoPool
¶
Bounded prefetch queue feeding decoded samples to consumers.
Use as a context manager so loader threads are always cleaned up::
with VideoPool(samples, loader_threads=1, max_size=4) as pool:
while True:
item = pool.get()
if item is None:
break
idx, decoded = item
results[idx] = worker.evaluate(**decoded)
Source code in fastvideo/eval/pool.py
Functions¶
fastvideo.eval.pool.VideoPool.get
¶Pop the next decoded (idx, sample).
Returns None when all input samples have been consumed.
Polls in 0.1 s slices so extra consumer threads (when
len(samples) < num_workers) wake up periodically to
re-check _consumed and exit cleanly — without the poll,
a blocking _ready_q.get(timeout=None) deadlocks because
the loaders are already done. Re-raises any exception caught
in a loader thread on the consumer's stack so callers don't
hang on a dead loader. Thread-safe: multiple consumer threads
may share one pool.
Source code in fastvideo/eval/pool.py
fastvideo.eval.registry
¶
Classes¶
Functions¶
fastvideo.eval.registry.get_metric
¶
get_metric(name: str, **kwargs: Any) -> BaseMetric
Instantiate a registered metric by name.
Checks that optional dependencies are installed before instantiation and gives a clear install hint pointing at the right extra group.
Source code in fastvideo/eval/registry.py
fastvideo.eval.registry.list_metrics
¶
fastvideo.eval.registry.missing_dependencies
¶
Importable module names declared by metric_name that are not
actually importable in this environment. Returns [] if all deps
are satisfied or the metric is unknown.
Used by group-style resolution to decide which metrics to silently
skip (vs. naming a metric explicitly, where the missing dep should
surface as :class:ImportError).
Source code in fastvideo/eval/registry.py
fastvideo.eval.registry.register
¶
register(name: str)
Decorator to register a metric class.
Usage::
@register("ssim")
class SSIMMetric(BaseMetric):
...
fastvideo.eval.registry.resolve_group
¶
If name is a group prefix (e.g. "vbench"), return all matching
metric names. Returns None if name is not a group.
Source code in fastvideo/eval/registry.py
fastvideo.eval.types
¶
Classes¶
fastvideo.eval.types.EvalResults
¶
EvalResults(samples: list[dict[str, MetricResult]] | None = None, corpus: dict[str, MetricResult] | None = None)
Bases: list
Return type for :meth:Evaluator.evaluate with samples=....
Behaves like a list[dict[str, MetricResult]] — one dict per
input sample, in input order — so existing iteration and indexing
keeps working. The corpus attribute carries set-metric results
(FAD, IS, …) that are properties of the whole input set, not of
any individual sample. Empty dict when no set metric ran.
Source code in fastvideo/eval/types.py
fastvideo.eval.types.MetricResult
dataclass
¶
Standard result container returned by all metrics.
score is None when the metric was skipped (e.g. missing
required input). Check details["skipped"] for the reason.
fastvideo.eval.worker
¶
EvalWorker: a single-GPU bag of metric replicas.
One EvalWorker per GPU. Holds an instance of every requested metric on
its own device, scores one sample at a time. Stateless across calls — the
worker never sees more than one evaluate(...) invocation at a time
(the parent :class:Evaluator ensures this by handing the worker to one
thread at a time).
This mirrors FastVideo's Worker layer under :class:VideoGenerator,
but in-process (threads, not processes) — eval metrics are independent
and need no NCCL / TP / SP / distributed init, so process isolation is
unnecessary overhead.
Classes¶
fastvideo.eval.worker.EvalWorker
¶
EvalWorker(metric_names: list[str], device: str, *, compile: bool = False, pre_upload: bool = True, skip_missing_deps: bool = False)
Owns metric replicas on one device. Single-GPU, single-sample.
Per-sample metrics (is_set_metric=False) return one
:class:MetricResult per evaluate call. Set metrics
(is_set_metric=True) accumulate state on the worker's own
instance and contribute nothing to the per-sample return — the
Evaluator finalizes them after the pool drains.
Source code in fastvideo/eval/worker.py
Attributes¶
fastvideo.eval.worker.EvalWorker.metric_names
property
¶Names of the metrics this worker owns, in load order.
Functions¶
fastvideo.eval.worker.EvalWorker.evaluate
¶evaluate(*, metrics: list[str] | None = None, **kwargs) -> dict[str, MetricResult]
Score one already-decoded sample.
Inputs (video / reference paths) are decoded and
normalized upstream by the :class:VideoPool. The worker
handles the optional pre-upload to its device and dispatches to
each metric's compute or accumulate.
Samples tagged role="reference" are corpus context for set
metrics only — per-sample metrics skip them so a mixed
Evaluator (e.g. FVD + LPIPS + vbench) doesn't produce spurious
per-sample scores on the reference half of the corpus.
metrics (when not None) further restricts dispatch to
that subset of registered names — used by the per-call
Evaluator.evaluate(metrics=...) filter.
Source code in fastvideo/eval/worker.py
fastvideo.eval.worker.EvalWorker.release_cuda_memory
¶ fastvideo.eval.worker.EvalWorker.reload
¶ fastvideo.eval.worker.EvalWorker.reset_set_metrics
¶ fastvideo.eval.worker.EvalWorker.set_metrics
¶set_metrics() -> dict[str, BaseMetric]
fastvideo.eval.worker.EvalWorker.unload
¶