common
¶
Modules¶
fastvideo.eval.metrics.common.fvd
¶
Modules¶
fastvideo.eval.metrics.common.fvd.extractors
¶
Pluggable video feature extractors for FVD.
Three extractors, sharing the _BaseExtractor contract:
i3d— Kinetics-400 I3D (TorchScript,flateon/FVD-I3D-torchscript). The standard FVD feature space used in the literature.clip— CLIP ViT-B/32 per-frame embeddings, mean-pooled over time. Captures semantic / content quality.videomae— VideoMAE-base last-hidden-state, mean-pooled over patch tokens. Captures structural / motion quality.
The contract is intentionally narrow: each extractor takes a
(B, T, C, H, W) float tensor in [0, 1] and returns (B, D) numpy
features. Preprocessing (resize, normalize, layout) is the extractor's job;
its callers should not care.
Functions¶
fastvideo.eval.metrics.common.fvd.extractors.load_extractor
¶load_extractor(name: str, device: device) -> _BaseExtractor
Instantiate the named extractor on device. Raises ValueError on unknown names.
Source code in fastvideo/eval/metrics/common/fvd/extractors.py
fastvideo.eval.metrics.common.fvd.metric
¶
common.fvd — Fréchet Video Distance.
Measures distributional similarity between generated and reference videos using a pluggable feature backbone (I3D / CLIP / VideoMAE). Lower score → generated videos are closer to the real video distribution.
FVD is a dataset-level metric — it compares the distribution of a collection of videos rather than scoring each video individually. For statistically reliable results at least 256 videos are recommended; the standard protocol uses 2048.
This metric follows the set-vs-set protocol (is_set_metric=True):
- :meth:accumulate is called once per video to buffer features. Routes
on sample["role"] — "reference" → real buffer, anything else →
generated buffer. This is the same convention
:class:audio.frechet_distance.FrechetAudioDistanceMetric uses.
- :meth:finalize is called once after all videos to compute FVD.
- :meth:reset clears both buffers between evaluation runs.
To score two corpora, build a samples list with
:func:fastvideo.eval.io.inputs.samples_from and hand it to
:meth:Evaluator.evaluate — the path-friendly form is::
from fastvideo.eval import create_evaluator, samples_from
ev = create_evaluator(metrics=["common.fvd"], device="cuda:0")
result = ev.evaluate(samples=samples_from(
video="gen/", reference="ref/",
)).corpus["common.fvd"]
Extractors¶
i3d(default) — Kinetics-400 I3D, the standard FVD feature space used in the literature.clip— CLIP ViT-B/32 per-frame embeds, mean-pooled over time. Captures semantic / content quality.videomae— VideoMAE-base last-hidden-state, mean-pooled over patches. Captures structural / motion quality.
CLIP and VideoMAE are research-grade and not directly comparable to FVD
scores from the literature; use i3d for paper comparisons.
Reference features¶
Two ways to supply reference features:
- Stream: pass
reference=to :func:samples_from. Equal cardinality emits paired samples (sample["video"]+sample["reference"]); unequal cardinality role-tags the unmatched references.accumulateroutes both shapes correctly. Ifcache_mode="read_write"(default) and no cache exists, the extracted features are persisted to cache_path on :meth:finalize. - Cache hit: a prior run wrote
cache_path; :meth:setuploads it automatically and streamed references are unnecessary.
Streamed references always win over the cache when both are present.
Cache resolution order (first match wins):
1. cache_path= constructor kwarg, if set.
2. $FASTVIDEO_FVD_REF_FEATURES, if set.
3. ${FASTVIDEO_EVAL_CACHE}/fvd/real_features_{extractor}.pt (default).
The env-var override mirrors audio.frechet_distance's
FASTVIDEO_FAD_REF_FEATURES pattern.
Classes¶
fastvideo.eval.metrics.common.fvd.metric.FVDMetric
¶FVDMetric(extractor: str = 'i3d', cache_path: str | None = None, cache_mode: str = 'read_write', chunk_size: int = 32)
Bases: BaseMetric
Fréchet Video Distance (FVD) over a pluggable feature backbone.
Set-vs-set metric (is_set_metric=True). The Evaluator calls
:meth:accumulate once per video and :meth:finalize once after
all videos have been processed.
For meaningful scores, evaluate over ≥ 256 videos (2048 is the standard protocol used in the literature).
Parameters¶
extractor : str
Feature backbone name. One of "i3d" (default), "clip",
"videomae". See module docstring for guidance.
cache_path : str, optional
Where extracted reference features are cached. Resolution order:
constructor kwarg → $FASTVIDEO_FVD_REF_FEATURES env-var →
${FASTVIDEO_EVAL_CACHE}/fvd/real_features_{extractor}.pt
(default, resolved at :meth:setup time so the env-var can be set
after import).
cache_mode : str
"read_write" (default) — read at setup, write at finalize on
cache miss. "read" — read at setup, never write. "off" —
ignore the cache entirely. Auto-write happens only when the cache
was missing AND reference features streamed in this run.
chunk_size : int
Videos per forward pass. Reduce if GPU runs OOM.
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.accumulate
¶accumulate(sample: dict) -> None
Extract features from one video and route by sample["role"].
Routing rules:
role="reference"→ real buffer; the reference key is ignored.- Otherwise → generated buffer. If
sample["reference"]is also present, its features also go into the real buffer — this lets a paired samples list (the shape per-sample metrics like LPIPS need) feed FVD in the same pass without a second decode.
Accepts either a raw tensor or a populated :class:Video instance
under either key.
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.finalize
¶finalize() -> MetricResult
Compute FVD from buffered generated features vs. reference features.
Reference source priority: streamed (_real_buf) > disk cache
(_cached_real). On a fresh cache + streamed refs, write the
accumulated real features back to disk if cache_mode="read_write".
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.merge_from
¶merge_from(other: BaseMetric) -> None
Fold another worker's accumulated features into this one (multi-GPU).
Both gen and ref buffers concatenate. The disk cache is the same
across workers, so we only adopt other's _cached_real if our
own slot is empty (defensive — should always already match).
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.reset
¶