Skip to content

metrics

Auto-discover and register all built-in metrics (recursive).

Walks the metrics package tree and imports any leaf-package's metric module so @register decorators fire. Path components starting with _ are skipped (used for shared helpers like optical_flow/_shared.py and vbench/_grit_helper.py).

Modules

fastvideo.eval.metrics.audio

Modules

fastvideo.eval.metrics.audio.audiobox_aesthetics
Modules
fastvideo.eval.metrics.audio.audiobox_aesthetics.metric

AudioBox Aesthetics (CE, CU, PC, PQ).

Thin wrapper around Meta's audiobox_aesthetics predictor. Returns four per-clip dimensions (CE — Content Enjoyment, CU — Content Usefulness, PC — Production Complexity, PQ — Production Quality); score exposes PQ, the dimension V2A papers typically report on. The remaining three are surfaced under details.

The earlier Verse-Bench combined score (CE + CU + PQ + (11 − PC)) / 4 is non-standard and is deliberately not used.

Classes
fastvideo.eval.metrics.audio.audiobox_aesthetics.metric.AudioBoxAestheticsMetric
AudioBoxAestheticsMetric()

Bases: BaseMetric

AudioBox Aesthetics: PQ as the primary score, CE/CU/PC/PQ in details.

Source code in fastvideo/eval/metrics/audio/audiobox_aesthetics/metric.py
def __init__(self) -> None:
    super().__init__()
    self._predictor: Any = None
Functions
fastvideo.eval.metrics.audio.clap_score
Modules
fastvideo.eval.metrics.audio.clap_score.metric

CLAP Score (CS) — text-audio cosine similarity.

Uses HuggingFace transformers.ClapModel with the laion/clap-htsat-fused checkpoint, the closest HF mirror of the 630k-audioset-fusion-best.pt weights that hkchengrex/av-benchmark and the V2A literature use. The fully byte-exact comparison still requires the laion_clap pip package (which pins numpy<2 and forces a fastvideo-wide downgrade — so we don't depend on it). Numbers from this metric are in the same ballpark as av-benchmark's LAION_CLAP.

Classes
fastvideo.eval.metrics.audio.clap_score.metric.ClapScoreMetric
ClapScoreMetric(model_name: str = DEFAULT_CLAP_REPO)

Bases: BaseMetric

CLAP score — text-audio cosine similarity via HF CLAP.

Source code in fastvideo/eval/metrics/audio/clap_score/metric.py
def __init__(self, model_name: str = DEFAULT_CLAP_REPO) -> None:
    super().__init__()
    self._model_name = model_name
    self._model: Any = None
    self._processor: Any = None
Functions
fastvideo.eval.metrics.audio.desync
Modules
fastvideo.eval.metrics.audio.desync.metric

Synchformer DeSync — average audio-video desynchronization in seconds.

Per-sample. Ports hkchengrex/av-benchmark's DeSync against the 24-01-04T16-39-21 Synchformer checkpoint (the one MMAudio reports against). argmax over the 21-class grid in [-2, +2] seconds for the first 14 and last 14 segments separately; lower is better, 0 = ideal.

Synchformer's transformer carries a fixed positional embedding sized for ~14 segments per modality, so keep input clips in the 3-10 s range — shorter clips raise in _segment_video and longer clips ignore the middle.

Classes
fastvideo.eval.metrics.audio.desync.metric.DeSyncMetric
DeSyncMetric(src_fps: float | None = None)

Bases: BaseMetric

Per-sample Synchformer DeSync magnitude in seconds (lower is better).

Source code in fastvideo/eval/metrics/audio/desync/metric.py
def __init__(self, src_fps: float | None = None) -> None:
    super().__init__()
    # Defaults to 25 fps (Synchformer's target) when the pool's decode fps is unknown.
    self._src_fps = src_fps
    self._model: Any = None
    self._mel: Any = None
    self._grid: torch.Tensor | None = None
    self._resample_cache: dict[int, Any] = {}
Functions
fastvideo.eval.metrics.audio.frechet_distance
Modules
fastvideo.eval.metrics.audio.frechet_distance.metric

Frechet Audio Distance over PaSST embeddings (FD_PaSST).

Corpus-vs-corpus Fréchet distance between Gaussian moments of two PaSST 768-d embedding sets. Ports av_bench.metrics.fad.compute_fd 1:1.

References are supplied per-sample via reference_audio / role="reference", or once from a .pt cache at $FASTVIDEO_FAD_REF_FEATURES. Skips when either side has fewer than two finite embeddings.

Classes
fastvideo.eval.metrics.audio.frechet_distance.metric.FrechetAudioDistanceMetric
FrechetAudioDistanceMetric()

Bases: BaseMetric

Corpus-vs-corpus Frechet Audio Distance with PaSST embeddings.

Source code in fastvideo/eval/metrics/audio/frechet_distance/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
    self._gen_buf: list[np.ndarray] = []
    self._ref_buf: list[np.ndarray] = []
    self._cached_ref_path: str | None = os.environ.get(REF_FEATURES_ENV)
    self._cached_ref_mu: np.ndarray | None = None
    self._cached_ref_sigma: np.ndarray | None = None
    self._n_cached_ref: int = 0
Functions
fastvideo.eval.metrics.audio.imagebind_score
Modules
fastvideo.eval.metrics.audio.imagebind_score.metric

ImageBind audio↔video cosine similarity (IB-Score).

Per-sample. Reads sample["video_path"] (or sample["video"].source for a :class:Video wrapper) and sample["audio"]; ImageBind decodes its own clips so the path is required, not the pool-decoded tensor.

Classes
fastvideo.eval.metrics.audio.imagebind_score.metric.ImageBindScoreMetric
ImageBindScoreMetric()

Bases: BaseMetric

ImageBind audio↔video cosine similarity, per-sample.

Source code in fastvideo/eval/metrics/audio/imagebind_score/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
Functions
fastvideo.eval.metrics.audio.kl_divergence
Modules
fastvideo.eval.metrics.audio.kl_divergence.metric

PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.

Ports av_bench.metrics.kl.compute_kl 1:1. The primary score is the softmax variant (reported as "MKL" / "KL_PaSST" by V2A papers); the sigmoid variant is exposed in details["kl_sigmoid"].

Classes
fastvideo.eval.metrics.audio.kl_divergence.metric.KLDivergenceMetric
KLDivergenceMetric()

Bases: BaseMetric

PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.

Per-sample. Requires sample["audio"] (generated) and sample["reference_audio"] (ground truth).

Source code in fastvideo/eval/metrics/audio/kl_divergence/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
Functions
fastvideo.eval.metrics.audio.wer
Modules
fastvideo.eval.metrics.audio.wer.metric

Word Error Rate for generated audio.

Per-sample. Default backend is Whisper-base; glm_asr (vendored under third_party/eval/glmasr/) and sensevoice (FunASR, install-on- demand) are selectable via asr_backend=. Text is normalized per the MagiHuman convention and CJK inputs are scored at the character level.

Classes
fastvideo.eval.metrics.audio.wer.metric.WERMetric
WERMetric(asr_backend: str = 'whisper', model_name: str | None = None, instruction: str = 'Please transcribe this audio into text')

Bases: BaseMetric

WER metric with selectable ASR backend.

Sample fields: audio, reference_text (or text_prompt), optional language. instruction is passed to GLM-ASR.

Source code in fastvideo/eval/metrics/audio/wer/metric.py
def __init__(
    self,
    asr_backend: str = "whisper",
    model_name: str | None = None,
    instruction: str = "Please transcribe this audio into text",
) -> None:
    super().__init__()
    if asr_backend not in ("whisper", "glm_asr", "sensevoice"):
        raise ValueError(f"Unknown ASR backend '{asr_backend}'. "
                         "Supported: ['whisper', 'glm_asr', 'sensevoice']")
    self._asr_backend = asr_backend
    self._model_name = model_name
    self._instruction = instruction
    self._model: Any = None
    self._processor: Any = None
Functions

fastvideo.eval.metrics.base

Classes

fastvideo.eval.metrics.base.BaseMetric
BaseMetric()

Abstract base class for all eval metrics.

Two execution shapes:

  • Per-sample (is_set_metric=False, default) — implement :meth:compute. The Evaluator calls it once per input sample and returns one :class:MetricResult per sample.
  • Set-vs-set (is_set_metric=True) — implement :meth:accumulate (called once per sample to buffer features) and :meth:finalize (called once after all samples to compute the corpus-level result). Use :meth:reset to clear buffers and :meth:merge_from to fold multi-GPU per-worker state together.

Optionally override :meth:setup to eagerly load models. Metrics that chunk along the time dim for memory hardcode their own chunk size in __init__ (see optical_flow for the canonical example). Eval always processes one video per :meth:Evaluator.evaluate call; compute / accumulate receive a single sample, not a batch.

Source code in fastvideo/eval/metrics/base.py
def __init__(self) -> None:
    self._device: torch.device = torch.device("cpu")
Functions
fastvideo.eval.metrics.base.BaseMetric.accumulate
accumulate(sample: dict) -> None

Buffer per-sample features for a corpus-level metric.

Source code in fastvideo/eval/metrics/base.py
def accumulate(self, sample: dict) -> None:
    """Buffer per-sample features for a corpus-level metric."""
    raise NotImplementedError(f"{type(self).__name__}.accumulate is not implemented")
fastvideo.eval.metrics.base.BaseMetric.compute
compute(sample: dict) -> MetricResult

Per-sample metrics: compute the score for one sample.

sample["video"] is (T, C, H, W) float in [0, 1]. sample["reference"] (if used) has the same shape. Return self._skip(sample, reason) for missing inputs.

Source code in fastvideo/eval/metrics/base.py
def compute(self, sample: dict) -> MetricResult:
    """Per-sample metrics: compute the score for one sample.

    ``sample["video"]`` is ``(T, C, H, W)`` float in ``[0, 1]``.
    ``sample["reference"]`` (if used) has the same shape. Return
    ``self._skip(sample, reason)`` for missing inputs.
    """
    raise NotImplementedError(f"{type(self).__name__}.compute is not implemented")
fastvideo.eval.metrics.base.BaseMetric.finalize
finalize() -> MetricResult

Compute the corpus-level result from buffered state.

Source code in fastvideo/eval/metrics/base.py
def finalize(self) -> MetricResult:
    """Compute the corpus-level result from buffered state."""
    raise NotImplementedError(f"{type(self).__name__}.finalize is not implemented")
fastvideo.eval.metrics.base.BaseMetric.merge_from
merge_from(other: BaseMetric) -> None

Multi-GPU: fold another worker's accumulator state into this one.

Source code in fastvideo/eval/metrics/base.py
def merge_from(self, other: BaseMetric) -> None:  # noqa: B027 - intentionally optional override
    """Multi-GPU: fold another worker's accumulator state into this one."""
fastvideo.eval.metrics.base.BaseMetric.reset
reset() -> None

Clear accumulator state at the start of each evaluate() call.

Source code in fastvideo/eval/metrics/base.py
def reset(self) -> None:  # noqa: B027 - intentionally optional override
    """Clear accumulator state at the start of each evaluate() call."""
fastvideo.eval.metrics.base.BaseMetric.setup
setup() -> None

Eagerly load models. Called once by :class:EvalWorker.

Default is a no-op; metrics with no eager state (pixel math, closed-form ops) inherit this. Override only if your metric needs to load weights.

Source code in fastvideo/eval/metrics/base.py
def setup(self) -> None:  # noqa: B027 - intentionally optional override
    """Eagerly load models. Called once by :class:`EvalWorker`.

    Default is a no-op; metrics with no eager state (pixel math,
    closed-form ops) inherit this. Override only if your metric
    needs to load weights.
    """
fastvideo.eval.metrics.base.BaseMetric.to
to(device: str | device) -> BaseMetric

Move metric (and its internal models) to device.

Source code in fastvideo/eval/metrics/base.py
def to(self, device: str | torch.device) -> BaseMetric:
    """Move metric (and its internal models) to *device*."""
    self._device = torch.device(device)
    return self

fastvideo.eval.metrics.common

Modules

fastvideo.eval.metrics.common.fvd
Modules
fastvideo.eval.metrics.common.fvd.extractors

Pluggable video feature extractors for FVD.

Three extractors, sharing the _BaseExtractor contract:

  • i3d — Kinetics-400 I3D (TorchScript, flateon/FVD-I3D-torchscript). The standard FVD feature space used in the literature.
  • clip — CLIP ViT-B/32 per-frame embeddings, mean-pooled over time. Captures semantic / content quality.
  • videomae — VideoMAE-base last-hidden-state, mean-pooled over patch tokens. Captures structural / motion quality.

The contract is intentionally narrow: each extractor takes a (B, T, C, H, W) float tensor in [0, 1] and returns (B, D) numpy features. Preprocessing (resize, normalize, layout) is the extractor's job; its callers should not care.

Functions
fastvideo.eval.metrics.common.fvd.extractors.load_extractor
load_extractor(name: str, device: device) -> _BaseExtractor

Instantiate the named extractor on device. Raises ValueError on unknown names.

Source code in fastvideo/eval/metrics/common/fvd/extractors.py
def load_extractor(name: str, device: torch.device) -> _BaseExtractor:
    """Instantiate the named extractor on *device*. Raises ``ValueError`` on unknown names."""
    cls = _EXTRACTORS.get(name)
    if cls is None:
        raise ValueError(f"Unknown FVD extractor '{name}'. Available: {available_extractors()}")
    return cls(device)
fastvideo.eval.metrics.common.fvd.metric

common.fvd — Fréchet Video Distance.

Measures distributional similarity between generated and reference videos using a pluggable feature backbone (I3D / CLIP / VideoMAE). Lower score → generated videos are closer to the real video distribution.

FVD is a dataset-level metric — it compares the distribution of a collection of videos rather than scoring each video individually. For statistically reliable results at least 256 videos are recommended; the standard protocol uses 2048.

This metric follows the set-vs-set protocol (is_set_metric=True):

- :meth:accumulate is called once per video to buffer features. Routes on sample["role"]"reference" → real buffer, anything else → generated buffer. This is the same convention :class:audio.frechet_distance.FrechetAudioDistanceMetric uses. - :meth:finalize is called once after all videos to compute FVD. - :meth:reset clears both buffers between evaluation runs.

To score two corpora, build a samples list with :func:fastvideo.eval.io.inputs.samples_from and hand it to :meth:Evaluator.evaluate — the path-friendly form is::

from fastvideo.eval import create_evaluator, samples_from
ev = create_evaluator(metrics=["common.fvd"], device="cuda:0")
result = ev.evaluate(samples=samples_from(
    video="gen/", reference="ref/",
)).corpus["common.fvd"]
Extractors
  • i3d (default) — Kinetics-400 I3D, the standard FVD feature space used in the literature.
  • clip — CLIP ViT-B/32 per-frame embeds, mean-pooled over time. Captures semantic / content quality.
  • videomae — VideoMAE-base last-hidden-state, mean-pooled over patches. Captures structural / motion quality.

CLIP and VideoMAE are research-grade and not directly comparable to FVD scores from the literature; use i3d for paper comparisons.

Reference features

Two ways to supply reference features:

  1. Stream: pass reference= to :func:samples_from. Equal cardinality emits paired samples (sample["video"] + sample["reference"]); unequal cardinality role-tags the unmatched references. accumulate routes both shapes correctly. If cache_mode="read_write" (default) and no cache exists, the extracted features are persisted to cache_path on :meth:finalize.
  2. Cache hit: a prior run wrote cache_path; :meth:setup loads it automatically and streamed references are unnecessary.

Streamed references always win over the cache when both are present.

Cache resolution order (first match wins): 1. cache_path= constructor kwarg, if set. 2. $FASTVIDEO_FVD_REF_FEATURES, if set. 3. ${FASTVIDEO_EVAL_CACHE}/fvd/real_features_{extractor}.pt (default).

The env-var override mirrors audio.frechet_distance's FASTVIDEO_FAD_REF_FEATURES pattern.

Classes
fastvideo.eval.metrics.common.fvd.metric.FVDMetric
FVDMetric(extractor: str = 'i3d', cache_path: str | None = None, cache_mode: str = 'read_write', chunk_size: int = 32)

Bases: BaseMetric

Fréchet Video Distance (FVD) over a pluggable feature backbone.

Set-vs-set metric (is_set_metric=True). The Evaluator calls :meth:accumulate once per video and :meth:finalize once after all videos have been processed.

For meaningful scores, evaluate over ≥ 256 videos (2048 is the standard protocol used in the literature).

Parameters

extractor : str Feature backbone name. One of "i3d" (default), "clip", "videomae". See module docstring for guidance. cache_path : str, optional Where extracted reference features are cached. Resolution order: constructor kwarg → $FASTVIDEO_FVD_REF_FEATURES env-var → ${FASTVIDEO_EVAL_CACHE}/fvd/real_features_{extractor}.pt (default, resolved at :meth:setup time so the env-var can be set after import). cache_mode : str "read_write" (default) — read at setup, write at finalize on cache miss. "read" — read at setup, never write. "off" — ignore the cache entirely. Auto-write happens only when the cache was missing AND reference features streamed in this run. chunk_size : int Videos per forward pass. Reduce if GPU runs OOM.

Source code in fastvideo/eval/metrics/common/fvd/metric.py
def __init__(
    self,
    extractor: str = "i3d",
    cache_path: str | None = None,
    cache_mode: str = "read_write",
    chunk_size: int = 32,
) -> None:
    super().__init__()
    if extractor not in available_extractors():
        raise ValueError(f"Unknown FVD extractor '{extractor}'. "
                         f"Available: {available_extractors()}")
    if cache_mode not in {"off", "read", "read_write"}:
        raise ValueError(f"cache_mode must be one of 'off', 'read', 'read_write'; got {cache_mode!r}")
    self._extractor_name = extractor
    self._cache_path_arg = cache_path
    self._cache_mode = cache_mode
    self._chunk = chunk_size
    self._extractor: _BaseExtractor | None = None

    # Accumulated feature buffers — cleared by reset()
    self._gen_buf: list[np.ndarray] = []
    self._real_buf: list[np.ndarray] = []
    # Cache loaded at setup() time; survives reset() (the cached corpus
    # is part of the metric's configured state, not its run state).
    self._cached_real: np.ndarray | None = None
Functions
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.accumulate
accumulate(sample: dict) -> None

Extract features from one video and route by sample["role"].

Routing rules:

  • role="reference" → real buffer; the reference key is ignored.
  • Otherwise → generated buffer. If sample["reference"] is also present, its features also go into the real buffer — this lets a paired samples list (the shape per-sample metrics like LPIPS need) feed FVD in the same pass without a second decode.

Accepts either a raw tensor or a populated :class:Video instance under either key.

Source code in fastvideo/eval/metrics/common/fvd/metric.py
def accumulate(self, sample: dict) -> None:
    """Extract features from one video and route by ``sample["role"]``.

    Routing rules:

    * ``role="reference"`` → real buffer; the reference key is ignored.
    * Otherwise → generated buffer.  If ``sample["reference"]`` is
      also present, its features also go into the real buffer — this
      lets a paired samples list (the shape per-sample metrics like
      LPIPS need) feed FVD in the same pass without a second decode.

    Accepts either a raw tensor or a populated :class:`Video` instance
    under either key.
    """
    if self._extractor is None:
        self.setup()
    assert self._extractor is not None  # for type narrowing

    if sample.get("role") == "reference":
        self._real_buf.append(self._extract(sample["video"]))
        return

    self._gen_buf.append(self._extract(sample["video"]))
    if "reference" in sample:
        self._real_buf.append(self._extract(sample["reference"]))
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.finalize
finalize() -> MetricResult

Compute FVD from buffered generated features vs. reference features.

Reference source priority: streamed (_real_buf) > disk cache (_cached_real). On a fresh cache + streamed refs, write the accumulated real features back to disk if cache_mode="read_write".

Source code in fastvideo/eval/metrics/common/fvd/metric.py
def finalize(self) -> MetricResult:
    """Compute FVD from buffered generated features vs. reference features.

    Reference source priority: streamed (``_real_buf``) > disk cache
    (``_cached_real``).  On a fresh cache + streamed refs, write the
    accumulated real features back to disk if ``cache_mode="read_write"``.
    """
    if not self._gen_buf:
        return MetricResult(
            name=self.name,
            score=None,
            details={"skipped": "No generated videos accumulated before finalize()."},
        )
    all_gen = np.concatenate(self._gen_buf, axis=0)

    if self._real_buf:
        all_real = np.concatenate(self._real_buf, axis=0)
        ref_source = "streamed"
        # Persist to cache only when we built real features fresh and
        # the cache slot was empty — never silently overwrite an
        # existing cache.
        if self._cache_mode == "read_write" and self._cached_real is None:
            self._save_cache(all_real)
    elif self._cached_real is not None:
        all_real = self._cached_real
        ref_source = "cached"
    else:
        return MetricResult(
            name=self.name,
            score=None,
            details={
                "skipped": ("No reference features available. Pass role='reference' samples "
                            "(or a paired samples list with 'reference' key) to "
                            f"Evaluator.evaluate(), or pre-build the cache at: {self.cache_path}")
            },
        )

    n_gen = len(all_gen)
    n_real = len(all_real)

    if n_gen < _MIN_VIDEOS_WARN or n_real < _MIN_VIDEOS_WARN:
        warnings.warn(
            f"FVD computed with only {n_gen} generated and {n_real} real videos. "
            f"At least {_MIN_VIDEOS_WARN} recommended (standard protocol: 2048). "
            "Score may not be statistically reliable.",
            stacklevel=2,
        )

    mu_gen, sigma_gen = _gaussian_params(all_gen)
    mu_real, sigma_real = _gaussian_params(all_real)
    fvd = _frechet_distance(mu_gen, sigma_gen, mu_real, sigma_real)

    return MetricResult(
        name=self.name,
        score=fvd,
        details={
            "extractor": self._extractor_name,
            "n_generated": n_gen,
            "n_reference": n_real,
            "ref_source": ref_source,
        },
    )
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.merge_from
merge_from(other: BaseMetric) -> None

Fold another worker's accumulated features into this one (multi-GPU).

Both gen and ref buffers concatenate. The disk cache is the same across workers, so we only adopt other's _cached_real if our own slot is empty (defensive — should always already match).

Source code in fastvideo/eval/metrics/common/fvd/metric.py
def merge_from(self, other: BaseMetric) -> None:
    """Fold another worker's accumulated features into this one (multi-GPU).

    Both gen and ref buffers concatenate.  The disk cache is the same
    across workers, so we only adopt *other*'s ``_cached_real`` if our
    own slot is empty (defensive — should always already match).
    """
    assert isinstance(other, FVDMetric)
    assert other._extractor_name == self._extractor_name, (
        f"merge_from extractor mismatch: {other._extractor_name!r} vs {self._extractor_name!r}")
    self._gen_buf.extend(other._gen_buf)
    self._real_buf.extend(other._real_buf)
    if self._cached_real is None and other._cached_real is not None:
        self._cached_real = other._cached_real
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.reset
reset() -> None

Clear generated and streamed-reference buffers. Cache survives.

Source code in fastvideo/eval/metrics/common/fvd/metric.py
def reset(self) -> None:
    """Clear generated and streamed-reference buffers. Cache survives."""
    self._gen_buf = []
    self._real_buf = []
Functions
fastvideo.eval.metrics.common.ssim
Modules
fastvideo.eval.metrics.common.ssim.metric
Classes Functions

fastvideo.eval.metrics.optical_flow

Modules

fastvideo.eval.metrics.optical_flow.gt_optical_flow
Modules
fastvideo.eval.metrics.optical_flow.gt_optical_flow.metric

Compare optical flow extracted from a generated video against optical flow extracted from a ground-truth reference video.

Both flows are produced by the same ptlflow model (default dpflow/things). The resulting per-pixel / per-frame / temporal metric set is identical to synthetic_optical_flow — only the way the reference flow is constructed differs.

Classes
fastvideo.eval.metrics.optical_flow.gt_optical_flow.metric.GtOpticalFlowMetric
GtOpticalFlowMetric(model_name: str = 'dpflow', ckpt: str = 'things', min_mag: float = 0.5, max_mag_pct: float = 80.0, grid_size: int = 8)

Bases: BaseMetric

Per-pixel / per-frame / temporal flow comparison vs. a reference video.

The headline score is pixel_epe_mean_mean (lower is better); every other scalar lives in details so downstream consumers can pick whichever one they care about.

Source code in fastvideo/eval/metrics/optical_flow/gt_optical_flow/metric.py
def __init__(
    self,
    model_name: str = "dpflow",
    ckpt: str = "things",
    min_mag: float = 0.5,
    max_mag_pct: float = 80.0,
    grid_size: int = 8,
) -> None:
    super().__init__()
    self.model_name = model_name
    self.ckpt = ckpt
    self.min_mag = min_mag
    self.max_mag_pct = max_mag_pct
    self.grid_size = grid_size
    self._model = None
    # One frame-pair per DPFlow forward: the cost volume is ~4 GB
    # per pair at 1080p, so batching OOMs. Bump for low-res inputs.
    self._chunk_size = 1
Functions
fastvideo.eval.metrics.optical_flow.synthetic_optical_flow
Modules
fastvideo.eval.metrics.optical_flow.synthetic_optical_flow.metric

Compare optical flow extracted from a generated video against optical flow synthesized analytically from per-frame actions.

The reference flow is not observed from a ground-truth video — it's predicted from the action stream via a third-person camera-kinematics model (Longuet-Higgins linearization + off-pivot translation correction; no depth). Observed flow comes from the same ptlflow model used by gt_optical_flow, and the two are compared with the identical metric set, so scores are directly comparable across the two metrics.

Required sample keys

video (B, T, C, H, W) float in [0, 1]. actions dict (or list-of-dicts of length B) with two np.ndarray keys:

* ``keyboard`` of shape ``(T, 6)`` — ``[W, S, A, D, turn_left, turn_right]``
* ``mouse`` of shape ``(T, 2)`` — ``[pitch, yaw]``

calibration Either a path to a ThirdPersonCalibration JSON file, or a dict of fitted parameters. May also be set once at construction time via calibration_path= and reused across samples.

Optional sample keys

mouse_pitch_sign +1 (default) or -1 if the dataset's mouse-pitch sign is flipped (mhuo's data carries this in metadata).

Classes
fastvideo.eval.metrics.optical_flow.synthetic_optical_flow.metric.SyntheticOpticalFlowMetric
SyntheticOpticalFlowMetric(model_name: str = 'dpflow', ckpt: str = 'things', calibration_path: str | Path | None = None, min_mag: float = 0.5, max_mag_pct: float = 80.0, grid_size: int = 8)

Bases: BaseMetric

Action-driven synthetic flow vs. video-extracted observed flow.

Pass calibration_path at construction to bind the calibration once across all samples; otherwise supply sample["calibration"] per call. Missing actions or calibration produce a skipped result (score=None) rather than raising.

Source code in fastvideo/eval/metrics/optical_flow/synthetic_optical_flow/metric.py
def __init__(
    self,
    model_name: str = "dpflow",
    ckpt: str = "things",
    calibration_path: str | Path | None = None,
    min_mag: float = 0.5,
    max_mag_pct: float = 80.0,
    grid_size: int = 8,
) -> None:
    super().__init__()
    self.model_name = model_name
    self.ckpt = ckpt
    self.min_mag = min_mag
    self.max_mag_pct = max_mag_pct
    self.grid_size = grid_size
    self._calibration: ThirdPersonCalibration | None = (_resolve_calibration(calibration_path)
                                                        if calibration_path else None)
    self._model = None
    # One frame-pair per DPFlow forward; see ``gt_optical_flow``.
    self._chunk_size = 1
Functions

fastvideo.eval.metrics.physics_iq

Modules

fastvideo.eval.metrics.physics_iq.utils
Functions
fastvideo.eval.metrics.physics_iq.utils.prepare_pair
prepare_pair(sample: dict[str, Any], *, prep_kwargs: dict[str, Any] | None = None) -> PreparedPhysicsIQPair

Resolve a sample into a prepared (gen, ref) pair.

Caches the result on sample['_physics_iq_pair'] so other physics_iq sub-metrics on the same sample reuse it instead of re-decoding.

Source code in fastvideo/eval/metrics/physics_iq/utils.py
def prepare_pair(
    sample: dict[str, Any],
    *,
    prep_kwargs: dict[str, Any] | None = None,
) -> PreparedPhysicsIQPair:
    """Resolve a sample into a prepared (gen, ref) pair.

    Caches the result on ``sample['_physics_iq_pair']`` so other physics_iq
    sub-metrics on the same sample reuse it instead of re-decoding.
    """
    prepared = sample.get("_physics_iq_pair")
    if prepared is not None:
        return prepared

    if "reference" not in sample:
        raise KeyError("Physics-IQ pair metrics require sample['reference'].")

    prepared = prepare_pair_inputs(
        sample["video"],
        sample["reference"],
        generated_mask=sample.get("video_mask"),
        reference_mask=sample.get("reference_mask"),
        **(prep_kwargs or {}),
    )
    sample["_physics_iq_pair"] = prepared
    return prepared

fastvideo.eval.metrics.vbench

VBench metrics. Bootstraps upstream submodule on sys.path and installs runtime compat shims for modern torch/transformers/numpy/timm.

The upstream vbench source lives as a git submodule at fastvideo/third_party/eval/vbench (pinned to a specific Vchitect/VBench SHA). We do not pip-install it — we only need its Python modules importable. Its runtime deps (clip, transformers, etc.) are already in FastVideo's main env.

Compat with modern dependency versions is achieved at import time, in this file, instead of via on-disk patches to upstream files. Each shim below corresponds to a specific drift between vbench's pinned-2023 deps and FastVideo's current pins. Adding a new shim is preferable to editing the submodule.

Modules

fastvideo.eval.metrics.vbench.aesthetic_quality
Modules
fastvideo.eval.metrics.vbench.aesthetic_quality.metric

VBench Aesthetic Quality — CLIP ViT-L/14 + LAION aesthetic predictor.

Encodes frames through CLIP, passes L2-normalized features through a linear aesthetic head (768 → 1), and averages scores / 10.

Classes Functions
fastvideo.eval.metrics.vbench.appearance_style
Modules
fastvideo.eval.metrics.vbench.appearance_style.metric

VBench Appearance Style — CLIP ViT-B/32 text-image alignment.

Per-frame cosine similarity between CLIP image features and a text prompt describing the expected style. Requires sample["text_prompt"].

Classes Functions
fastvideo.eval.metrics.vbench.background_consistency
Modules
fastvideo.eval.metrics.vbench.background_consistency.metric

VBench Background Consistency — CLIP ViT-B/32 temporal feature similarity.

Measures background stability via cosine similarity of CLIP features between consecutive frames and the first frame.

Classes Functions
fastvideo.eval.metrics.vbench.color
Modules
fastvideo.eval.metrics.vbench.color.metric

VBench Color — GRiT dense captioning for color accuracy.

Detects the target object via GRiT and checks if the expected color keyword appears in the object's caption. Score = frames_with_correct_color / frames_with_object_detected.

Classes Functions
fastvideo.eval.metrics.vbench.dynamic_degree
Modules
fastvideo.eval.metrics.vbench.dynamic_degree.metric

VBench Dynamic Degree — RAFT optical flow motion detection.

For each consecutive frame pair, computes optical flow via RAFT and takes the mean of the top 5% flow magnitudes. If enough pairs exceed an adaptive threshold, the video is classified as dynamic (1.0) vs static (0.0).

Classes
fastvideo.eval.metrics.vbench.dynamic_degree.metric.DynamicDegreeMetric
DynamicDegreeMetric()

Bases: BaseMetric

Source code in fastvideo/eval/metrics/vbench/dynamic_degree/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
    self._chunk_size = 16
Functions
fastvideo.eval.metrics.vbench.human_action
Modules
fastvideo.eval.metrics.vbench.human_action.metric

VBench Human Action — UMT ViT-L/16 action classification (Kinetics-400).

Classifies human actions in 16-frame clips. Top-5 predictions with confidence >= 0.85 are compared against the ground-truth action label. Score = 1.0 if match found, 0.0 otherwise.

Classes Functions
fastvideo.eval.metrics.vbench.imaging_quality
Modules
fastvideo.eval.metrics.vbench.imaging_quality.metric

VBench Imaging Quality — MUSIQ-based per-frame technical quality.

Uses MUSIQ (Multi-Scale Image Quality) from pyiqa. Frames are resized so the longer side is at most 512px. Score = mean(MUSIQ_scores) / 100.

Classes Functions
fastvideo.eval.metrics.vbench.motion_smoothness
Modules
fastvideo.eval.metrics.vbench.motion_smoothness.metric

VBench Motion Smoothness — AMT-S frame interpolation quality.

Takes every-other frame, uses AMT-S to interpolate the missing middle frames, then compares interpolated vs actual frames. Score = (255 - mean_pixel_diff) / 255. Higher = smoother motion.

Classes
fastvideo.eval.metrics.vbench.motion_smoothness.metric.MotionSmoothnessMetric
MotionSmoothnessMetric()

Bases: BaseMetric

Source code in fastvideo/eval/metrics/vbench/motion_smoothness/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
    self._embt: Any = None
    self._chunk_size = 8
Functions
fastvideo.eval.metrics.vbench.multiple_objects
Modules
fastvideo.eval.metrics.vbench.multiple_objects.metric

VBench Multiple Objects — GRiT detection for dual-object presence.

Checks if BOTH target objects are detected in each of 16 sampled frames. Score = matching_frames / total_frames.

Classes Functions
fastvideo.eval.metrics.vbench.object_class
Modules
fastvideo.eval.metrics.vbench.object_class.metric

VBench Object Class — GRiT object detection for class matching.

Checks if a target object class is detected in each of 16 sampled frames. Score = matching_frames / total_frames.

Classes Functions
fastvideo.eval.metrics.vbench.overall_consistency
Modules
fastvideo.eval.metrics.vbench.overall_consistency.metric

VBench Overall Consistency — ViCLIP text-video alignment.

Encodes 8 sampled video frames via ViCLIP vision encoder and a text prompt via ViCLIP text encoder, then computes cosine similarity.

Classes Functions
fastvideo.eval.metrics.vbench.scene
Modules
fastvideo.eval.metrics.vbench.scene.metric

VLM-based scene matching using AVoCaDO (Qwen2.5-Omni).

Replaces VBench's Tag2Text-based scene metric with a modern VLM caption. The algorithm follows VBench: 1. Caption the video 2. Check if all scene keywords appear in the caption 3. Score = 1.0 if all match, 0.0 otherwise

Unlike VBench (which captions each frame separately with Tag2Text), AVoCaDO captions the entire video in one pass with rich natural language, making the keyword check more robust.

Classes
fastvideo.eval.metrics.vbench.scene.metric.SceneMetric
SceneMetric(model_path: str = 'AVoCaDO-Captioner/AVoCaDO')

Bases: BaseMetric

Source code in fastvideo/eval/metrics/vbench/scene/metric.py
def __init__(self, model_path: str = "AVoCaDO-Captioner/AVoCaDO") -> None:
    super().__init__()
    self._model: Any = None
    self._processor: Any = None
    self._model_path = model_path
Functions
fastvideo.eval.metrics.vbench.spatial_relationship
Modules
fastvideo.eval.metrics.vbench.spatial_relationship.metric

VBench Spatial Relationship — GRiT detection + bbox position scoring.

Detects two target objects via GRiT and checks if their bounding boxes satisfy the expected spatial relationship (left/right/above/below).

Classes Functions
fastvideo.eval.metrics.vbench.subject_consistency
Modules
fastvideo.eval.metrics.vbench.subject_consistency.metric

VBench Subject Consistency — DINO ViT-B/16 temporal feature similarity.

Measures how well the main subject maintains its appearance throughout the video via cosine similarity of DINO features between consecutive frames and the first frame.

Classes Functions
fastvideo.eval.metrics.vbench.temporal_flickering
Modules
fastvideo.eval.metrics.vbench.temporal_flickering.metric

VBench Temporal Flickering — measures frame-to-frame stability.

Score = (255 - mean_MAE) / 255, where MAE is computed between consecutive frames in uint8 [0, 255] space. Higher = less flickering.

Classes Functions
fastvideo.eval.metrics.vbench.temporal_style
Modules
fastvideo.eval.metrics.vbench.temporal_style.metric

VBench Temporal Style — ViCLIP text-video alignment (style focus).

Identical logic to overall_consistency — same ViCLIP cosine similarity. The difference is semantic: overall_consistency measures general prompt alignment while temporal_style measures style consistency over time. VBench uses different prompts for each from its metadata JSON.

Functions

fastvideo.eval.metrics.videoscore2

Modules

fastvideo.eval.metrics.videoscore2.metric

VideoScore2 — VLM-based video quality scoring.

Uses a Qwen2.5-VL model fine-tuned to score generated videos on three dimensions: visual quality, text-to-video alignment, and physical consistency. Scores are extracted from token logits as upstream's ll_based_soft_score_normed weighting (1-5 scale).

Reference: TIGER-AI-Lab/VideoScore2 (vs2_inference.py).

Classes
fastvideo.eval.metrics.videoscore2.metric.VideoScore2Metric
VideoScore2Metric(model_name: str = 'TIGER-Lab/VideoScore2', infer_fps: float = 2.0, max_tokens: int = 1024, temperature: float = 0.7, do_sample: bool = True)

Bases: BaseMetric

VideoScore2: VLM-based video quality scoring (3 dimensions).

Requires sample["text_prompt"] for text-to-video alignment. Supports batched generation for GPU efficiency.

Source code in fastvideo/eval/metrics/videoscore2/metric.py
def __init__(
    self,
    model_name: str = "TIGER-Lab/VideoScore2",
    infer_fps: float = 2.0,
    max_tokens: int = 1024,
    temperature: float = 0.7,
    do_sample: bool = True,
) -> None:
    super().__init__()
    self._model_name = model_name
    self.infer_fps = infer_fps
    self.max_tokens = max_tokens
    self.temperature = temperature
    self.do_sample = do_sample
    self._model: Any = None
    self._processor: Any = None
    self._tokenizer: Any = None
Functions