metrics
¶
Auto-discover and register all built-in metrics (recursive).
Walks the metrics package tree and imports any leaf-package's
metric module so @register decorators fire. Path components
starting with _ are skipped (used for shared helpers like
optical_flow/_shared.py and vbench/_grit_helper.py).
Modules¶
fastvideo.eval.metrics.audio
¶
Modules¶
fastvideo.eval.metrics.audio.audiobox_aesthetics
¶
Modules¶
fastvideo.eval.metrics.audio.audiobox_aesthetics.metric
¶AudioBox Aesthetics (CE, CU, PC, PQ).
Thin wrapper around Meta's audiobox_aesthetics predictor. Returns
four per-clip dimensions (CE — Content Enjoyment, CU — Content
Usefulness, PC — Production Complexity, PQ — Production Quality);
score exposes PQ, the dimension V2A papers typically report on.
The remaining three are surfaced under details.
The earlier Verse-Bench combined score (CE + CU + PQ + (11 − PC)) / 4
is non-standard and is deliberately not used.
fastvideo.eval.metrics.audio.audiobox_aesthetics.metric.AudioBoxAestheticsMetric
¶
Bases: BaseMetric
AudioBox Aesthetics: PQ as the primary score, CE/CU/PC/PQ in details.
Source code in fastvideo/eval/metrics/audio/audiobox_aesthetics/metric.py
fastvideo.eval.metrics.audio.clap_score
¶
Modules¶
fastvideo.eval.metrics.audio.clap_score.metric
¶CLAP Score (CS) — text-audio cosine similarity.
Uses HuggingFace transformers.ClapModel with the
laion/clap-htsat-fused checkpoint, the closest HF mirror of the
630k-audioset-fusion-best.pt weights that
hkchengrex/av-benchmark and the V2A literature use. The fully
byte-exact comparison still requires the laion_clap pip package
(which pins numpy<2 and forces a fastvideo-wide downgrade — so
we don't depend on it). Numbers from this metric are in the same
ballpark as av-benchmark's LAION_CLAP.
fastvideo.eval.metrics.audio.clap_score.metric.ClapScoreMetric
¶ClapScoreMetric(model_name: str = DEFAULT_CLAP_REPO)
Bases: BaseMetric
CLAP score — text-audio cosine similarity via HF CLAP.
Source code in fastvideo/eval/metrics/audio/clap_score/metric.py
fastvideo.eval.metrics.audio.desync
¶
Modules¶
fastvideo.eval.metrics.audio.desync.metric
¶Synchformer DeSync — average audio-video desynchronization in seconds.
Per-sample. Ports hkchengrex/av-benchmark's DeSync against the
24-01-04T16-39-21 Synchformer checkpoint (the one MMAudio reports
against). argmax over the 21-class grid in [-2, +2] seconds for
the first 14 and last 14 segments separately; lower is better, 0 = ideal.
Synchformer's transformer carries a fixed positional embedding sized for
~14 segments per modality, so keep input clips in the 3-10 s range —
shorter clips raise in _segment_video and longer clips ignore the
middle.
fastvideo.eval.metrics.audio.desync.metric.DeSyncMetric
¶DeSyncMetric(src_fps: float | None = None)
Bases: BaseMetric
Per-sample Synchformer DeSync magnitude in seconds (lower is better).
Source code in fastvideo/eval/metrics/audio/desync/metric.py
fastvideo.eval.metrics.audio.frechet_distance
¶
Modules¶
fastvideo.eval.metrics.audio.frechet_distance.metric
¶Frechet Audio Distance over PaSST embeddings (FD_PaSST).
Corpus-vs-corpus Fréchet distance between Gaussian moments of two PaSST
768-d embedding sets. Ports av_bench.metrics.fad.compute_fd 1:1.
References are supplied per-sample via reference_audio /
role="reference", or once from a .pt cache at
$FASTVIDEO_FAD_REF_FEATURES. Skips when either side has fewer than
two finite embeddings.
fastvideo.eval.metrics.audio.frechet_distance.metric.FrechetAudioDistanceMetric
¶
Bases: BaseMetric
Corpus-vs-corpus Frechet Audio Distance with PaSST embeddings.
Source code in fastvideo/eval/metrics/audio/frechet_distance/metric.py
fastvideo.eval.metrics.audio.imagebind_score
¶
Modules¶
fastvideo.eval.metrics.audio.imagebind_score.metric
¶ImageBind audio↔video cosine similarity (IB-Score).
Per-sample. Reads sample["video_path"] (or sample["video"].source
for a :class:Video wrapper) and sample["audio"]; ImageBind decodes
its own clips so the path is required, not the pool-decoded tensor.
fastvideo.eval.metrics.audio.imagebind_score.metric.ImageBindScoreMetric
¶
Bases: BaseMetric
ImageBind audio↔video cosine similarity, per-sample.
Source code in fastvideo/eval/metrics/audio/imagebind_score/metric.py
fastvideo.eval.metrics.audio.kl_divergence
¶
Modules¶
fastvideo.eval.metrics.audio.kl_divergence.metric
¶PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.
Ports av_bench.metrics.kl.compute_kl 1:1. The primary score is
the softmax variant (reported as "MKL" / "KL_PaSST" by V2A papers);
the sigmoid variant is exposed in details["kl_sigmoid"].
fastvideo.eval.metrics.audio.kl_divergence.metric.KLDivergenceMetric
¶
Bases: BaseMetric
PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.
Per-sample. Requires sample["audio"] (generated) and
sample["reference_audio"] (ground truth).
Source code in fastvideo/eval/metrics/audio/kl_divergence/metric.py
fastvideo.eval.metrics.audio.wer
¶
Modules¶
fastvideo.eval.metrics.audio.wer.metric
¶Word Error Rate for generated audio.
Per-sample. Default backend is Whisper-base; glm_asr (vendored under
third_party/eval/glmasr/) and sensevoice (FunASR, install-on-
demand) are selectable via asr_backend=. Text is normalized per the
MagiHuman convention and CJK inputs are scored at the character level.
fastvideo.eval.metrics.audio.wer.metric.WERMetric
¶WERMetric(asr_backend: str = 'whisper', model_name: str | None = None, instruction: str = 'Please transcribe this audio into text')
Bases: BaseMetric
WER metric with selectable ASR backend.
Sample fields: audio, reference_text (or text_prompt),
optional language. instruction is passed to GLM-ASR.
Source code in fastvideo/eval/metrics/audio/wer/metric.py
fastvideo.eval.metrics.base
¶
Classes¶
fastvideo.eval.metrics.base.BaseMetric
¶
Abstract base class for all eval metrics.
Two execution shapes:
- Per-sample (
is_set_metric=False, default) — implement :meth:compute. The Evaluator calls it once per input sample and returns one :class:MetricResultper sample.
- Set-vs-set (
is_set_metric=True) — implement :meth:accumulate(called once per sample to buffer features) and :meth:finalize(called once after all samples to compute the corpus-level result). Use :meth:resetto clear buffers and :meth:merge_fromto fold multi-GPU per-worker state together.
Optionally override :meth:setup to eagerly load models. Metrics
that chunk along the time dim for memory hardcode their own chunk
size in __init__ (see optical_flow for the canonical
example). Eval always processes one video per
:meth:Evaluator.evaluate call; compute / accumulate
receive a single sample, not a batch.
Source code in fastvideo/eval/metrics/base.py
Functions¶
fastvideo.eval.metrics.base.BaseMetric.compute
¶compute(sample: dict) -> MetricResult
Per-sample metrics: compute the score for one sample.
sample["video"] is (T, C, H, W) float in [0, 1].
sample["reference"] (if used) has the same shape. Return
self._skip(sample, reason) for missing inputs.
Source code in fastvideo/eval/metrics/base.py
fastvideo.eval.metrics.base.BaseMetric.finalize
¶finalize() -> MetricResult
fastvideo.eval.metrics.base.BaseMetric.merge_from
¶merge_from(other: BaseMetric) -> None
fastvideo.eval.metrics.base.BaseMetric.reset
¶ fastvideo.eval.metrics.base.BaseMetric.setup
¶Eagerly load models. Called once by :class:EvalWorker.
Default is a no-op; metrics with no eager state (pixel math, closed-form ops) inherit this. Override only if your metric needs to load weights.
Source code in fastvideo/eval/metrics/base.py
fastvideo.eval.metrics.base.BaseMetric.to
¶to(device: str | device) -> BaseMetric
fastvideo.eval.metrics.common
¶
Modules¶
fastvideo.eval.metrics.common.fvd
¶
Modules¶
fastvideo.eval.metrics.common.fvd.extractors
¶Pluggable video feature extractors for FVD.
Three extractors, sharing the _BaseExtractor contract:
i3d— Kinetics-400 I3D (TorchScript,flateon/FVD-I3D-torchscript). The standard FVD feature space used in the literature.clip— CLIP ViT-B/32 per-frame embeddings, mean-pooled over time. Captures semantic / content quality.videomae— VideoMAE-base last-hidden-state, mean-pooled over patch tokens. Captures structural / motion quality.
The contract is intentionally narrow: each extractor takes a
(B, T, C, H, W) float tensor in [0, 1] and returns (B, D) numpy
features. Preprocessing (resize, normalize, layout) is the extractor's job;
its callers should not care.
fastvideo.eval.metrics.common.fvd.extractors.load_extractor
¶load_extractor(name: str, device: device) -> _BaseExtractor
Instantiate the named extractor on device. Raises ValueError on unknown names.
Source code in fastvideo/eval/metrics/common/fvd/extractors.py
fastvideo.eval.metrics.common.fvd.metric
¶common.fvd — Fréchet Video Distance.
Measures distributional similarity between generated and reference videos using a pluggable feature backbone (I3D / CLIP / VideoMAE). Lower score → generated videos are closer to the real video distribution.
FVD is a dataset-level metric — it compares the distribution of a collection of videos rather than scoring each video individually. For statistically reliable results at least 256 videos are recommended; the standard protocol uses 2048.
This metric follows the set-vs-set protocol (is_set_metric=True):
- :meth:accumulate is called once per video to buffer features. Routes
on sample["role"] — "reference" → real buffer, anything else →
generated buffer. This is the same convention
:class:audio.frechet_distance.FrechetAudioDistanceMetric uses.
- :meth:finalize is called once after all videos to compute FVD.
- :meth:reset clears both buffers between evaluation runs.
To score two corpora, build a samples list with
:func:fastvideo.eval.io.inputs.samples_from and hand it to
:meth:Evaluator.evaluate — the path-friendly form is::
from fastvideo.eval import create_evaluator, samples_from
ev = create_evaluator(metrics=["common.fvd"], device="cuda:0")
result = ev.evaluate(samples=samples_from(
video="gen/", reference="ref/",
)).corpus["common.fvd"]
Extractors¶
i3d(default) — Kinetics-400 I3D, the standard FVD feature space used in the literature.clip— CLIP ViT-B/32 per-frame embeds, mean-pooled over time. Captures semantic / content quality.videomae— VideoMAE-base last-hidden-state, mean-pooled over patches. Captures structural / motion quality.
CLIP and VideoMAE are research-grade and not directly comparable to FVD
scores from the literature; use i3d for paper comparisons.
Reference features¶
Two ways to supply reference features:
- Stream: pass
reference=to :func:samples_from. Equal cardinality emits paired samples (sample["video"]+sample["reference"]); unequal cardinality role-tags the unmatched references.accumulateroutes both shapes correctly. Ifcache_mode="read_write"(default) and no cache exists, the extracted features are persisted to cache_path on :meth:finalize. - Cache hit: a prior run wrote
cache_path; :meth:setuploads it automatically and streamed references are unnecessary.
Streamed references always win over the cache when both are present.
Cache resolution order (first match wins):
1. cache_path= constructor kwarg, if set.
2. $FASTVIDEO_FVD_REF_FEATURES, if set.
3. ${FASTVIDEO_EVAL_CACHE}/fvd/real_features_{extractor}.pt (default).
The env-var override mirrors audio.frechet_distance's
FASTVIDEO_FAD_REF_FEATURES pattern.
fastvideo.eval.metrics.common.fvd.metric.FVDMetric
¶FVDMetric(extractor: str = 'i3d', cache_path: str | None = None, cache_mode: str = 'read_write', chunk_size: int = 32)
Bases: BaseMetric
Fréchet Video Distance (FVD) over a pluggable feature backbone.
Set-vs-set metric (is_set_metric=True). The Evaluator calls
:meth:accumulate once per video and :meth:finalize once after
all videos have been processed.
For meaningful scores, evaluate over ≥ 256 videos (2048 is the standard protocol used in the literature).
Parameters¶
extractor : str
Feature backbone name. One of "i3d" (default), "clip",
"videomae". See module docstring for guidance.
cache_path : str, optional
Where extracted reference features are cached. Resolution order:
constructor kwarg → $FASTVIDEO_FVD_REF_FEATURES env-var →
${FASTVIDEO_EVAL_CACHE}/fvd/real_features_{extractor}.pt
(default, resolved at :meth:setup time so the env-var can be set
after import).
cache_mode : str
"read_write" (default) — read at setup, write at finalize on
cache miss. "read" — read at setup, never write. "off" —
ignore the cache entirely. Auto-write happens only when the cache
was missing AND reference features streamed in this run.
chunk_size : int
Videos per forward pass. Reduce if GPU runs OOM.
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.accumulate
¶accumulate(sample: dict) -> None
Extract features from one video and route by sample["role"].
Routing rules:
role="reference"→ real buffer; the reference key is ignored.- Otherwise → generated buffer. If
sample["reference"]is also present, its features also go into the real buffer — this lets a paired samples list (the shape per-sample metrics like LPIPS need) feed FVD in the same pass without a second decode.
Accepts either a raw tensor or a populated :class:Video instance
under either key.
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.finalize
¶finalize() -> MetricResult
Compute FVD from buffered generated features vs. reference features.
Reference source priority: streamed (_real_buf) > disk cache
(_cached_real). On a fresh cache + streamed refs, write the
accumulated real features back to disk if cache_mode="read_write".
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.merge_from
¶merge_from(other: BaseMetric) -> None
Fold another worker's accumulated features into this one (multi-GPU).
Both gen and ref buffers concatenate. The disk cache is the same
across workers, so we only adopt other's _cached_real if our
own slot is empty (defensive — should always already match).
Source code in fastvideo/eval/metrics/common/fvd/metric.py
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.reset
¶
fastvideo.eval.metrics.optical_flow
¶
Modules¶
fastvideo.eval.metrics.optical_flow.gt_optical_flow
¶
Modules¶
fastvideo.eval.metrics.optical_flow.gt_optical_flow.metric
¶Compare optical flow extracted from a generated video against optical flow extracted from a ground-truth reference video.
Both flows are produced by the same ptlflow model (default
dpflow/things). The resulting per-pixel / per-frame / temporal
metric set is identical to synthetic_optical_flow — only the way the
reference flow is constructed differs.
fastvideo.eval.metrics.optical_flow.gt_optical_flow.metric.GtOpticalFlowMetric
¶GtOpticalFlowMetric(model_name: str = 'dpflow', ckpt: str = 'things', min_mag: float = 0.5, max_mag_pct: float = 80.0, grid_size: int = 8)
Bases: BaseMetric
Per-pixel / per-frame / temporal flow comparison vs. a reference video.
The headline score is pixel_epe_mean_mean (lower is better);
every other scalar lives in details so downstream consumers can
pick whichever one they care about.
Source code in fastvideo/eval/metrics/optical_flow/gt_optical_flow/metric.py
fastvideo.eval.metrics.optical_flow.synthetic_optical_flow
¶
Modules¶
fastvideo.eval.metrics.optical_flow.synthetic_optical_flow.metric
¶Compare optical flow extracted from a generated video against optical flow synthesized analytically from per-frame actions.
The reference flow is not observed from a ground-truth video — it's
predicted from the action stream via a third-person camera-kinematics
model (Longuet-Higgins linearization + off-pivot translation correction;
no depth). Observed flow comes from the same ptlflow model used by
gt_optical_flow, and the two are compared with the identical metric
set, so scores are directly comparable across the two metrics.
Required sample keys¶
video
(B, T, C, H, W) float in [0, 1].
actions
dict (or list-of-dicts of length B) with two np.ndarray keys:
* ``keyboard`` of shape ``(T, 6)`` — ``[W, S, A, D, turn_left, turn_right]``
* ``mouse`` of shape ``(T, 2)`` — ``[pitch, yaw]``
calibration
Either a path to a ThirdPersonCalibration JSON file, or a dict
of fitted parameters. May also be set once at construction time via
calibration_path= and reused across samples.
Optional sample keys¶
mouse_pitch_sign
+1 (default) or -1 if the dataset's mouse-pitch sign is
flipped (mhuo's data carries this in metadata).
fastvideo.eval.metrics.optical_flow.synthetic_optical_flow.metric.SyntheticOpticalFlowMetric
¶SyntheticOpticalFlowMetric(model_name: str = 'dpflow', ckpt: str = 'things', calibration_path: str | Path | None = None, min_mag: float = 0.5, max_mag_pct: float = 80.0, grid_size: int = 8)
Bases: BaseMetric
Action-driven synthetic flow vs. video-extracted observed flow.
Pass calibration_path at construction to bind the calibration
once across all samples; otherwise supply sample["calibration"]
per call. Missing actions or calibration produce a skipped result
(score=None) rather than raising.
Source code in fastvideo/eval/metrics/optical_flow/synthetic_optical_flow/metric.py
fastvideo.eval.metrics.physics_iq
¶
Modules¶
fastvideo.eval.metrics.physics_iq.utils
¶
Functions¶
fastvideo.eval.metrics.physics_iq.utils.prepare_pair
¶prepare_pair(sample: dict[str, Any], *, prep_kwargs: dict[str, Any] | None = None) -> PreparedPhysicsIQPair
Resolve a sample into a prepared (gen, ref) pair.
Caches the result on sample['_physics_iq_pair'] so other physics_iq
sub-metrics on the same sample reuse it instead of re-decoding.
Source code in fastvideo/eval/metrics/physics_iq/utils.py
fastvideo.eval.metrics.vbench
¶
VBench metrics. Bootstraps upstream submodule on sys.path and installs runtime compat shims for modern torch/transformers/numpy/timm.
The upstream vbench source lives as a git submodule at
fastvideo/third_party/eval/vbench (pinned to a specific
Vchitect/VBench SHA). We do not pip-install it — we only need its
Python modules importable. Its runtime deps (clip, transformers, etc.)
are already in FastVideo's main env.
Compat with modern dependency versions is achieved at import time, in this file, instead of via on-disk patches to upstream files. Each shim below corresponds to a specific drift between vbench's pinned-2023 deps and FastVideo's current pins. Adding a new shim is preferable to editing the submodule.
Modules¶
fastvideo.eval.metrics.vbench.aesthetic_quality
¶
fastvideo.eval.metrics.vbench.appearance_style
¶
fastvideo.eval.metrics.vbench.background_consistency
¶
fastvideo.eval.metrics.vbench.dynamic_degree
¶
Modules¶
fastvideo.eval.metrics.vbench.dynamic_degree.metric
¶VBench Dynamic Degree — RAFT optical flow motion detection.
For each consecutive frame pair, computes optical flow via RAFT and takes the mean of the top 5% flow magnitudes. If enough pairs exceed an adaptive threshold, the video is classified as dynamic (1.0) vs static (0.0).
fastvideo.eval.metrics.vbench.dynamic_degree.metric.DynamicDegreeMetric
¶
fastvideo.eval.metrics.vbench.human_action
¶
fastvideo.eval.metrics.vbench.imaging_quality
¶
fastvideo.eval.metrics.vbench.motion_smoothness
¶
Modules¶
fastvideo.eval.metrics.vbench.motion_smoothness.metric
¶VBench Motion Smoothness — AMT-S frame interpolation quality.
Takes every-other frame, uses AMT-S to interpolate the missing middle frames, then compares interpolated vs actual frames. Score = (255 - mean_pixel_diff) / 255. Higher = smoother motion.
fastvideo.eval.metrics.vbench.multiple_objects
¶
fastvideo.eval.metrics.vbench.object_class
¶
fastvideo.eval.metrics.vbench.overall_consistency
¶
fastvideo.eval.metrics.vbench.scene
¶
Modules¶
fastvideo.eval.metrics.vbench.scene.metric
¶VLM-based scene matching using AVoCaDO (Qwen2.5-Omni).
Replaces VBench's Tag2Text-based scene metric with a modern VLM caption. The algorithm follows VBench: 1. Caption the video 2. Check if all scene keywords appear in the caption 3. Score = 1.0 if all match, 0.0 otherwise
Unlike VBench (which captions each frame separately with Tag2Text), AVoCaDO captions the entire video in one pass with rich natural language, making the keyword check more robust.
fastvideo.eval.metrics.vbench.spatial_relationship
¶
fastvideo.eval.metrics.vbench.subject_consistency
¶
fastvideo.eval.metrics.vbench.temporal_flickering
¶
fastvideo.eval.metrics.vbench.temporal_style
¶
Modules¶
fastvideo.eval.metrics.vbench.temporal_style.metric
¶VBench Temporal Style — ViCLIP text-video alignment (style focus).
Identical logic to overall_consistency — same ViCLIP cosine similarity. The difference is semantic: overall_consistency measures general prompt alignment while temporal_style measures style consistency over time. VBench uses different prompts for each from its metadata JSON.
fastvideo.eval.metrics.videoscore2
¶
Modules¶
fastvideo.eval.metrics.videoscore2.metric
¶
VideoScore2 — VLM-based video quality scoring.
Uses a Qwen2.5-VL model fine-tuned to score generated videos on three
dimensions: visual quality, text-to-video alignment, and physical
consistency. Scores are extracted from token logits as upstream's
ll_based_soft_score_normed weighting (1-5 scale).
Reference: TIGER-AI-Lab/VideoScore2 (vs2_inference.py).
Classes¶
fastvideo.eval.metrics.videoscore2.metric.VideoScore2Metric
¶VideoScore2Metric(model_name: str = 'TIGER-Lab/VideoScore2', infer_fps: float = 2.0, max_tokens: int = 1024, temperature: float = 0.7, do_sample: bool = True)
Bases: BaseMetric
VideoScore2: VLM-based video quality scoring (3 dimensions).
Requires sample["text_prompt"] for text-to-video alignment.
Supports batched generation for GPU efficiency.