Skip to content

eval

Classes

fastvideo.eval.BaseMetric

BaseMetric()

Abstract base class for all eval metrics.

Two execution shapes:

  • Per-sample (is_set_metric=False, default) — implement :meth:compute. The Evaluator calls it once per input sample and returns one :class:MetricResult per sample.
  • Set-vs-set (is_set_metric=True) — implement :meth:accumulate (called once per sample to buffer features) and :meth:finalize (called once after all samples to compute the corpus-level result). Use :meth:reset to clear buffers and :meth:merge_from to fold multi-GPU per-worker state together.

Optionally override :meth:setup to eagerly load models. Metrics that chunk along the time dim for memory hardcode their own chunk size in __init__ (see optical_flow for the canonical example). Eval always processes one video per :meth:Evaluator.evaluate call; compute / accumulate receive a single sample, not a batch.

Source code in fastvideo/eval/metrics/base.py
def __init__(self) -> None:
    self._device: torch.device = torch.device("cpu")

Functions

fastvideo.eval.BaseMetric.accumulate
accumulate(sample: dict) -> None

Buffer per-sample features for a corpus-level metric.

Source code in fastvideo/eval/metrics/base.py
def accumulate(self, sample: dict) -> None:
    """Buffer per-sample features for a corpus-level metric."""
    raise NotImplementedError(f"{type(self).__name__}.accumulate is not implemented")
fastvideo.eval.BaseMetric.compute
compute(sample: dict) -> MetricResult

Per-sample metrics: compute the score for one sample.

sample["video"] is (T, C, H, W) float in [0, 1]. sample["reference"] (if used) has the same shape. Return self._skip(sample, reason) for missing inputs.

Source code in fastvideo/eval/metrics/base.py
def compute(self, sample: dict) -> MetricResult:
    """Per-sample metrics: compute the score for one sample.

    ``sample["video"]`` is ``(T, C, H, W)`` float in ``[0, 1]``.
    ``sample["reference"]`` (if used) has the same shape. Return
    ``self._skip(sample, reason)`` for missing inputs.
    """
    raise NotImplementedError(f"{type(self).__name__}.compute is not implemented")
fastvideo.eval.BaseMetric.finalize
finalize() -> MetricResult

Compute the corpus-level result from buffered state.

Source code in fastvideo/eval/metrics/base.py
def finalize(self) -> MetricResult:
    """Compute the corpus-level result from buffered state."""
    raise NotImplementedError(f"{type(self).__name__}.finalize is not implemented")
fastvideo.eval.BaseMetric.merge_from
merge_from(other: BaseMetric) -> None

Multi-GPU: fold another worker's accumulator state into this one.

Source code in fastvideo/eval/metrics/base.py
def merge_from(self, other: BaseMetric) -> None:  # noqa: B027 - intentionally optional override
    """Multi-GPU: fold another worker's accumulator state into this one."""
fastvideo.eval.BaseMetric.reset
reset() -> None

Clear accumulator state at the start of each evaluate() call.

Source code in fastvideo/eval/metrics/base.py
def reset(self) -> None:  # noqa: B027 - intentionally optional override
    """Clear accumulator state at the start of each evaluate() call."""
fastvideo.eval.BaseMetric.setup
setup() -> None

Eagerly load models. Called once by :class:EvalWorker.

Default is a no-op; metrics with no eager state (pixel math, closed-form ops) inherit this. Override only if your metric needs to load weights.

Source code in fastvideo/eval/metrics/base.py
def setup(self) -> None:  # noqa: B027 - intentionally optional override
    """Eagerly load models. Called once by :class:`EvalWorker`.

    Default is a no-op; metrics with no eager state (pixel math,
    closed-form ops) inherit this. Override only if your metric
    needs to load weights.
    """
fastvideo.eval.BaseMetric.to
to(device: str | device) -> BaseMetric

Move metric (and its internal models) to device.

Source code in fastvideo/eval/metrics/base.py
def to(self, device: str | torch.device) -> BaseMetric:
    """Move metric (and its internal models) to *device*."""
    self._device = torch.device(device)
    return self

fastvideo.eval.EvalResults

EvalResults(samples: list[dict[str, MetricResult]] | None = None, corpus: dict[str, MetricResult] | None = None)

Bases: list

Return type for :meth:Evaluator.evaluate with samples=....

Behaves like a list[dict[str, MetricResult]] — one dict per input sample, in input order — so existing iteration and indexing keeps working. The corpus attribute carries set-metric results (FAD, IS, …) that are properties of the whole input set, not of any individual sample. Empty dict when no set metric ran.

Source code in fastvideo/eval/types.py
def __init__(
    self,
    samples: list[dict[str, MetricResult]] | None = None,
    corpus: dict[str, MetricResult] | None = None,
) -> None:
    super().__init__(samples or [])
    self.corpus: dict[str, MetricResult] = corpus or {}

fastvideo.eval.Evaluator

Evaluator(metrics: list[str] | str = 'all', device: str = 'cuda:0', num_gpus: int = 1, compile: bool = False, *, loader_threads: int = 1, prefetch_factor: int = 2, pre_upload: bool = True, skip_missing_deps: bool = False)

Pre-initialized scorer for repeated evaluation.

Parameters

metrics : list[str] | str Metric names, group prefixes ("vbench"), or "all". device : str Single-GPU device (e.g. "cuda:0"). Ignored when num_gpus > 1. num_gpus : int Number of GPU replicas. Each gets its own :class:EvalWorker. compile : bool Apply :func:torch.compile to each metric's _model. loader_threads : int Background decode threads in the :class:VideoPool. Default 1 (hide decode behind compute). Bump for I/O-heavy benchmark sets where one loader can't keep up with the workers. prefetch_factor : int pool max_size = prefetch_factor * num_workers. Default 2 — one sample being consumed, one prefetched per worker. pre_upload : bool When True (default), the worker performs a single host→device upload of video / reference per sample before the metric loop, and every metric reads from that shared GPU-resident tensor. Without it, each metric pays its own .to(self.device) — N transfers of the same clip for N metrics, which dominates at high resolution. Set False for training-time eval, where keeping a clip resident on GPU across the metric loop would fight the training step for VRAM. skip_missing_deps : bool When True, silently drop explicit metric names whose optional deps aren't importable (with a one-line warning per skipped metric). Default False — an explicit name with a missing dep raises :class:ImportError at construction time. Group selectors ("vbench", "all") always silent-skip regardless of this flag.

Source code in fastvideo/eval/evaluator.py
def __init__(
    self,
    metrics: list[str] | str = "all",
    device: str = "cuda:0",
    num_gpus: int = 1,
    compile: bool = False,
    *,
    loader_threads: int = 1,
    prefetch_factor: int = 2,
    pre_upload: bool = True,
    skip_missing_deps: bool = False,
) -> None:
    names = _resolve_metric_names(metrics, skip_missing_deps=skip_missing_deps)
    if num_gpus > 1:
        self._workers = [
            EvalWorker(names,
                       f"cuda:{i}",
                       compile=compile,
                       pre_upload=pre_upload,
                       skip_missing_deps=skip_missing_deps) for i in range(num_gpus)
        ]
    else:
        self._workers = [
            EvalWorker(names, device, compile=compile, pre_upload=pre_upload, skip_missing_deps=skip_missing_deps)
        ]
    self._loader_threads = max(1, loader_threads)
    self._prefetch_factor = max(1, prefetch_factor)

Functions

fastvideo.eval.Evaluator.evaluate
evaluate(samples: Iterable[dict] | None = None, *, metrics: list[str] | None = None, **kwargs) -> dict[str, MetricResult] | EvalResults

Score one sample (kwargs form) or many samples (list form).

Both forms go through the same :class:VideoPool pipeline; video / reference paths are decoded asynchronously.

Parameters

samples : Iterable of sample dicts. Omit and pass kwargs for a single-sample call. metrics : Subset of this Evaluator's registered metrics to actually run on this batch. None (default) runs all registered. Lets a single long-lived Evaluator score different (gen, ref) corpora with different metric subsets across multiple evaluate() calls — e.g. LPIPS on a paired corpus, FVD on an unequal-cardinality corpus — without burning model loads. Set-metric accumulators are reset only for the metrics included in metrics, so state for other set metrics is preserved across calls.

Single sample::

ev.evaluate(video=tensor, text_prompt="...", fps=24.0)

Many samples::

ev.evaluate(samples=[{"video": ..., "reference": ...}, ...])

Many samples with a metric filter::

ev.evaluate(samples=lpips_samples, metrics=["common.lpips"])
ev.evaluate(samples=fvd_samples,   metrics=["common.fvd"])
Returns

dict[str, MetricResult] for the single-sample form; :class:EvalResults (list-of-dict subclass with .corpus) for the list form.

Source code in fastvideo/eval/evaluator.py
def evaluate(
    self,
    samples: Iterable[dict] | None = None,
    *,
    metrics: list[str] | None = None,
    **kwargs,
) -> dict[str, MetricResult] | EvalResults:
    """Score one sample (kwargs form) or many samples (list form).

    Both forms go through the same :class:`VideoPool` pipeline;
    ``video`` / ``reference`` paths are decoded asynchronously.

    Parameters
    ----------
    samples :
        Iterable of sample dicts.  Omit and pass kwargs for a
        single-sample call.
    metrics :
        Subset of this Evaluator's registered metrics to actually
        run on this batch.  ``None`` (default) runs all registered.
        Lets a single long-lived Evaluator score different (gen,
        ref) corpora with different metric subsets across multiple
        ``evaluate()`` calls — e.g. LPIPS on a paired corpus, FVD
        on an unequal-cardinality corpus — without burning model
        loads.  Set-metric accumulators are reset only for the
        metrics included in *metrics*, so state for other set
        metrics is preserved across calls.

    Single sample::

        ev.evaluate(video=tensor, text_prompt="...", fps=24.0)

    Many samples::

        ev.evaluate(samples=[{"video": ..., "reference": ...}, ...])

    Many samples with a metric filter::

        ev.evaluate(samples=lpips_samples, metrics=["common.lpips"])
        ev.evaluate(samples=fvd_samples,   metrics=["common.fvd"])

    Returns
    -------
    dict[str, MetricResult] for the single-sample form;
    :class:`EvalResults` (list-of-dict subclass with ``.corpus``) for
    the list form.
    """
    if metrics is not None:
        unknown = [m for m in metrics if m not in self.metric_names]
        if unknown:
            raise ValueError(f"metrics filter contains names not registered on this Evaluator: "
                             f"{unknown}; registered: {self.metric_names}")

    single = samples is None
    sample_list: list[dict] = [kwargs] if samples is None else list(samples)
    if not sample_list:
        return EvalResults(samples=[], corpus={})

    if single:
        set_names = self._workers[0].set_metrics().keys()
        active_set = (set_names if metrics is None else (set_names & set(metrics)))
        if active_set:
            # Set metrics need a population. A single sample can't produce a
            # meaningful corpus result, and silently discarding it (return
            # ``per_sample[0]`` only) hides the no-op. Force the list form.
            raise ValueError("Set-vs-set metrics require samples=[...] with >=2 entries; "
                             "the kwargs form (single sample) cannot produce a corpus "
                             f"result. Active set metrics: {sorted(active_set)}")

    per_sample, corpus = self._run(sample_list, metric_filter=metrics)

    if single:
        return per_sample[0]
    return EvalResults(samples=per_sample, corpus=corpus)
fastvideo.eval.Evaluator.release_cuda_memory
release_cuda_memory() -> None

Free CUDA caches on every replica without dropping models.

Source code in fastvideo/eval/evaluator.py
def release_cuda_memory(self) -> None:
    """Free CUDA caches on every replica without dropping models."""
    for w in self._workers:
        w.release_cuda_memory()
fastvideo.eval.Evaluator.reload
reload() -> None

Rebuild metrics dropped by :meth:unload.

Source code in fastvideo/eval/evaluator.py
def reload(self) -> None:
    """Rebuild metrics dropped by :meth:`unload`."""
    for w in self._workers:
        w.reload()
fastvideo.eval.Evaluator.shutdown
shutdown() -> None

No-op; kept for API compatibility with older callers.

Source code in fastvideo/eval/evaluator.py
def shutdown(self) -> None:
    """No-op; kept for API compatibility with older callers."""
fastvideo.eval.Evaluator.unload
unload() -> None

Drop metric refs on every replica. Reverse with :meth:reload.

Source code in fastvideo/eval/evaluator.py
def unload(self) -> None:
    """Drop metric refs on every replica. Reverse with :meth:`reload`."""
    for w in self._workers:
        w.unload()

fastvideo.eval.MetricResult dataclass

MetricResult(name: str, score: float | None, details: dict[str, Any] = dict())

Standard result container returned by all metrics.

score is None when the metric was skipped (e.g. missing required input). Check details["skipped"] for the reason.

fastvideo.eval.Video dataclass

Video(source: Any, fps: float | None = None, frames: Any = None, audio: Any = None, audio_sr: int | None = None)

Path-backed media handle. The :class:VideoPool populates frames (and optionally audio) before the metric loop sees the sample.

Functions

fastvideo.eval.as_video

as_video(x: str | Path | Tensor | Video) -> Video

Coerce path/tensor/Video → :class:Video for the pool to decode.

Path strings and :class:pathlib.Path become Video(source=str(x)); the pool then calls :func:load_video on first use. Tensors become Video(source=None, frames=x) — the pool sees .frames already populated and forwards untouched. :class:Video instances pass through.

Source code in fastvideo/eval/io/inputs.py
def as_video(x: str | Path | torch.Tensor | Video) -> Video:
    """Coerce path/tensor/Video → :class:`Video` for the pool to decode.

    Path strings and :class:`pathlib.Path` become ``Video(source=str(x))``;
    the pool then calls :func:`load_video` on first use.  Tensors become
    ``Video(source=None, frames=x)`` — the pool sees ``.frames`` already
    populated and forwards untouched.  :class:`Video` instances pass through.
    """
    if isinstance(x, Video):
        return x
    if isinstance(x, str | Path):
        return Video(source=str(x))
    if isinstance(x, torch.Tensor):
        return Video(source=None, frames=x)
    raise TypeError(f"Cannot coerce {type(x).__name__} to Video")

fastvideo.eval.ensure_checkpoint

ensure_checkpoint(name: str, source: str, filename: str | None = None) -> str

Resolve a model checkpoint path, downloading on miss.

See module docstring for the full source contract. name is used only as the local cache filename for URL sources; ignored otherwise.

Source code in fastvideo/eval/models.py
def ensure_checkpoint(
    name: str,
    source: str,
    filename: str | None = None,
) -> str:
    """Resolve a model checkpoint path, downloading on miss.

    See module docstring for the full source contract. *name* is used
    only as the local cache filename for URL sources; ignored otherwise.
    """
    if os.path.exists(source):
        return source

    if source.startswith(("http://", "https://")):
        return _ensure_url(name, source)

    if "/" in source:
        return _ensure_hf(source, filename)

    raise ValueError(f"Cannot resolve checkpoint: source {source!r} is neither a "
                     "path, URL, nor HF repo id")

fastvideo.eval.evaluate

evaluate(generated: Tensor | str | Path, reference: Tensor | str | Path | None = None, metrics: list[str] | str = 'all', device: str = 'cuda', **kwargs) -> dict[str, MetricResult] | list[dict[str, MetricResult]]

One-shot evaluation. For repeated use, prefer :func:create_evaluator.

Parameters

generated : Tensor | str | Path Generated video. Either a pre-loaded (T, C, H, W) tensor or a path to an mp4/avi/etc. — paths are decoded by the worker. reference : Tensor | str | Path | None Reference video (same accepted shapes as generated). metrics : list[str] | str Metric names, or "all". device : str PyTorch device string.

Source code in fastvideo/eval/api.py
def evaluate(
    generated: torch.Tensor | str | Path,
    reference: torch.Tensor | str | Path | None = None,
    metrics: list[str] | str = "all",
    device: str = "cuda",
    **kwargs,
) -> dict[str, MetricResult] | list[dict[str, MetricResult]]:
    """One-shot evaluation. For repeated use, prefer :func:`create_evaluator`.

    Parameters
    ----------
    generated : Tensor | str | Path
        Generated video. Either a pre-loaded ``(T, C, H, W)`` tensor or a
        path to an mp4/avi/etc. — paths are decoded by the worker.
    reference : Tensor | str | Path | None
        Reference video (same accepted shapes as *generated*).
    metrics : list[str] | str
        Metric names, or ``"all"``.
    device : str
        PyTorch device string.
    """
    ev = create_evaluator(metrics=metrics, device=device)
    kw: dict = {"video": generated, **kwargs}
    if reference is not None:
        kw["reference"] = reference
    return ev.evaluate(**kw)

fastvideo.eval.get_cache_dir

get_cache_dir() -> Path

Eval cache root.

Layout::

get_cache_dir() / models /   ← URL-fetched checkpoints (LAION head,
                               AMT, GRiT, …)
get_cache_dir() / torch  /   ← redirected ``TORCH_HOME`` (DINO etc.)
get_cache_dir() / clip   /   ← passed as ``download_root`` to
                               ``clip.load(...)`` callsites
~/.cache/huggingface/hub /   ← left at HF's default; widely shared
                               with other ML projects

Override priority: FASTVIDEO_EVAL_CACHE > ${FASTVIDEO_CACHE_ROOT}/eval.

Metric authors writing new code: when wrapping a third-party loader that has its own cache convention (CLIP's download_root, pyiqa's cache_dir, etc.), pass str(get_cache_dir() / "<library>") so users get a single FASTVIDEO_EVAL_CACHE knob to redirect them all.

Source code in fastvideo/eval/models.py
def get_cache_dir() -> Path:
    """Eval cache root.

    Layout::

        get_cache_dir() / models /   ← URL-fetched checkpoints (LAION head,
                                       AMT, GRiT, …)
        get_cache_dir() / torch  /   ← redirected ``TORCH_HOME`` (DINO etc.)
        get_cache_dir() / clip   /   ← passed as ``download_root`` to
                                       ``clip.load(...)`` callsites
        ~/.cache/huggingface/hub /   ← left at HF's default; widely shared
                                       with other ML projects

    Override priority: ``FASTVIDEO_EVAL_CACHE`` > ``${FASTVIDEO_CACHE_ROOT}/eval``.

    Metric authors writing new code: when wrapping a third-party loader
    that has its own cache convention (CLIP's ``download_root``, pyiqa's
    ``cache_dir``, etc.), pass ``str(get_cache_dir() / "<library>")`` so
    users get a single ``FASTVIDEO_EVAL_CACHE`` knob to redirect them all.
    """
    return Path(os.environ.get(
        "FASTVIDEO_EVAL_CACHE",
        os.path.join(envs.FASTVIDEO_CACHE_ROOT, "eval"),
    ))

fastvideo.eval.get_metric

get_metric(name: str, **kwargs: Any) -> BaseMetric

Instantiate a registered metric by name.

Checks that optional dependencies are installed before instantiation and gives a clear install hint pointing at the right extra group.

Source code in fastvideo/eval/registry.py
def get_metric(name: str, **kwargs: Any) -> BaseMetric:
    """Instantiate a registered metric by name.

    Checks that optional dependencies are installed before instantiation
    and gives a clear install hint pointing at the right extra group.
    """
    cls = _REGISTRY.get(name)
    if cls is None:
        available = ", ".join(sorted(_REGISTRY.keys()))
        raise KeyError(f"Unknown metric '{name}'. Available: {available}")

    for dep in getattr(cls, "dependencies", []):
        if not importlib.util.find_spec(dep):
            raise ImportError(f"{cls.__name__} requires '{dep}'. "
                              f"Install with: {_install_hint(name, dep)}")

    return cls(**kwargs)

fastvideo.eval.list_metrics

list_metrics() -> list[str]

Return sorted list of all registered metric names.

Source code in fastvideo/eval/registry.py
def list_metrics() -> list[str]:
    """Return sorted list of all registered metric names."""
    return sorted(_REGISTRY.keys())

fastvideo.eval.register

register(name: str)

Decorator to register a metric class.

Usage::

@register("ssim")
class SSIMMetric(BaseMetric):
    ...
Source code in fastvideo/eval/registry.py
def register(name: str):
    """Decorator to register a metric class.

    Usage::

        @register("ssim")
        class SSIMMetric(BaseMetric):
            ...
    """

    def wrapper(cls):
        _REGISTRY[name] = cls
        return cls

    return wrapper

fastvideo.eval.samples_from

samples_from(*, video: PathSpec | None = None, reference: PathSpec | None = None, audio: PathSpec | None = None, reference_audio: PathSpec | None = None, text_prompt: str | None = None, text_prompts: str | Path | list[str] | None = None, fps: float | None = None, auxiliary_info: dict | list[dict] | None = None, extras: dict | list[dict] | None = None, extract_audio: bool | str | Path = False, extract_workers: int = 4) -> list[dict]

Build a samples list from path-style inputs.

Parameters

video, reference, audio, reference_audio : File path, directory of files (sorted by name), or any iterable of paths. Pass whichever modalities apply to the metrics you plan to run — they attach to sample["video"] / sample["reference"] / sample["audio"] / sample["reference_audio"] respectively. Video paths are wrapped in :class:Video so :class:VideoPool decodes them lazily in parallel; audio paths stay as strings (audio metrics each load with their own resample / preprocess). text_prompt : A single prompt string broadcast onto every sample. text_prompts : A list of strings (one per sample), or a path to a .jsonl / .json file containing per-sample prompts. fps : Scalar fps broadcast onto every sample. auxiliary_info : Single dict (broadcast) or list of dicts (zipped) for sample["auxiliary_info"] — vbench structured-prompt metrics read this. extras : Catch-all per-sample attachments. Use for metric-specific keys the dedicated kwargs don't cover (scenario, view, actions, calibration, reference_take2, ...). Pass a single dict to broadcast or a list-of-dicts to zip; the keys merge into each sample dict. extract_audio : If truthy, auto-extract audio from each video / reference source into .wav files via PyAV and attach the paths under sample["audio"] / sample["reference_audio"]. Pass a path for a persistent cache, True for a tempdir. Skipped silently for videos with no audio stream; ignored wherever audio / reference_audio is already explicit. extract_workers : Parallel workers for extract_audio.

Returns

list[dict] Canonical samples shape — hand directly to :meth:Evaluator.evaluate.

Cardinality and shape

Let N = len(generated inputs) (the agreed length of whichever of video / audio you passed). References are attached 1:1 onto the first N samples; any extras (when |ref| > N) become standalone role-tagged samples at the end of the list, so set metrics like FVD see the full reference corpus while per-sample paired metrics like LPIPS only run on the first N pairs.

Notes

"Missing" keys are simply absent from the sample dict. Metrics handle them per their own contract (sample.get(...) for optional, sample[...] raises for required, or :meth:BaseMetric._skip for opt-in skip behavior). One fat samples list with many keys can serve many metrics — each reads its subset.

Source code in fastvideo/eval/io/inputs.py
def samples_from(
    *,
    # Modality inputs — pass whichever you have, in any combination.
    video: PathSpec | None = None,
    reference: PathSpec | None = None,
    audio: PathSpec | None = None,
    reference_audio: PathSpec | None = None,
    # Per-sample attachments.  Scalars broadcast to every sample; lists
    # / jsonl paths zip per-sample.
    text_prompt: str | None = None,
    text_prompts: str | Path | list[str] | None = None,
    fps: float | None = None,
    auxiliary_info: dict | list[dict] | None = None,
    # Catch-all for exotic metric inputs (physics_iq scenarios,
    # synthetic_optical_flow actions, ...).  Single dict broadcasts;
    # list of dicts zips.  Keys merge into each sample dict.
    extras: dict | list[dict] | None = None,
    # Sugar: pull audio off the video sources via PyAV.  Pass a path
    # to use a persistent on-disk cache, ``True`` for a system tempdir.
    extract_audio: bool | str | Path = False,
    extract_workers: int = 4,
) -> list[dict]:
    """Build a samples list from path-style inputs.

    Parameters
    ----------
    video, reference, audio, reference_audio :
        File path, directory of files (sorted by name), or any iterable
        of paths.  Pass whichever modalities apply to the metrics you
        plan to run — they attach to ``sample["video"]`` /
        ``sample["reference"]`` / ``sample["audio"]`` /
        ``sample["reference_audio"]`` respectively.  Video paths are
        wrapped in :class:`Video` so :class:`VideoPool` decodes them
        lazily in parallel; audio paths stay as strings (audio metrics
        each load with their own resample / preprocess).
    text_prompt :
        A single prompt string broadcast onto every sample.
    text_prompts :
        A list of strings (one per sample), or a path to a ``.jsonl`` /
        ``.json`` file containing per-sample prompts.
    fps :
        Scalar fps broadcast onto every sample.
    auxiliary_info :
        Single dict (broadcast) or list of dicts (zipped) for
        ``sample["auxiliary_info"]`` — vbench structured-prompt
        metrics read this.
    extras :
        Catch-all per-sample attachments.  Use for metric-specific keys
        the dedicated kwargs don't cover (``scenario``, ``view``,
        ``actions``, ``calibration``, ``reference_take2``, ...).  Pass
        a single dict to broadcast or a list-of-dicts to zip; the keys
        merge into each sample dict.
    extract_audio :
        If truthy, auto-extract audio from each ``video`` /
        ``reference`` source into ``.wav`` files via PyAV and attach
        the paths under ``sample["audio"]`` / ``sample["reference_audio"]``.
        Pass a path for a persistent cache, ``True`` for a tempdir.
        Skipped silently for videos with no audio stream; ignored
        wherever ``audio`` / ``reference_audio`` is already explicit.
    extract_workers :
        Parallel workers for ``extract_audio``.

    Returns
    -------
    list[dict]
        Canonical samples shape — hand directly to
        :meth:`Evaluator.evaluate`.

    Cardinality and shape
    ---------------------
    Let ``N = len(generated inputs)`` (the agreed length of whichever
    of ``video`` / ``audio`` you passed).  References are attached
    1:1 onto the first N samples; any extras (when ``|ref| > N``)
    become standalone role-tagged samples at the end of the list, so
    set metrics like FVD see the full reference corpus while per-sample
    paired metrics like LPIPS only run on the first N pairs.

    Notes
    -----
    "Missing" keys are simply absent from the sample dict.  Metrics
    handle them per their own contract (``sample.get(...)`` for
    optional, ``sample[...]`` raises for required, or
    :meth:`BaseMetric._skip` for opt-in skip behavior).  One fat samples
    list with many keys can serve many metrics — each reads its subset.
    """
    gen_video = _expand(video, _VIDEO_EXTS) if video is not None else None
    gen_audio = _expand(audio, _AUDIO_EXTS) if audio is not None else None
    ref_video = _expand(reference, _VIDEO_EXTS) if reference is not None else None
    ref_audio = _expand(reference_audio, _AUDIO_EXTS) if reference_audio is not None else None

    if gen_video is None and gen_audio is None:
        raise ValueError("samples_from: pass at least one of video= or audio=")

    # All "generated" inputs must agree on length — they describe the
    # same N samples in different modalities.
    n_candidates = [len(x) for x in (gen_video, gen_audio) if x is not None]
    if len({*n_candidates}) != 1:
        raise ValueError(f"samples_from: generated inputs have inconsistent lengths {n_candidates}")
    n = n_candidates[0]

    prompts = _broadcast(text_prompts if text_prompts is not None else text_prompt, n, loader=_load_prompts)
    aux = _broadcast(auxiliary_info, n)
    extras_per_sample = _broadcast(extras, n)
    if (text_prompts is not None) and (text_prompt is not None):
        raise ValueError("Pass either text_prompt (broadcast) or text_prompts (per-sample), not both.")

    samples: list[dict[str, Any]] = []
    for i in range(n):
        s: dict[str, Any] = {}
        if gen_video is not None:
            s["video"] = as_video(gen_video[i])
        if gen_audio is not None:
            s["audio"] = str(gen_audio[i])
        if ref_video is not None and i < len(ref_video):
            s["reference"] = as_video(ref_video[i])
        if ref_audio is not None and i < len(ref_audio):
            s["reference_audio"] = str(ref_audio[i])
        if prompts is not None:
            s["text_prompt"] = prompts[i]
        if fps is not None:
            s["fps"] = fps
        if aux is not None:
            s["auxiliary_info"] = aux[i]
        if extras_per_sample is not None:
            s.update(extras_per_sample[i])
        samples.append(s)

    # Unmatched references → role-tagged set samples.  Per-sample paired
    # metrics skip these via the worker's role-skip rule; set metrics
    # (FVD, FAD) accumulate the features.
    if ref_video is not None and len(ref_video) > n:
        for v in ref_video[n:]:
            samples.append({"video": as_video(v), "role": "reference"})
    if ref_audio is not None and len(ref_audio) > n:
        for a in ref_audio[n:]:
            samples.append({"audio": str(a), "role": "reference"})

    if extract_audio:
        _attach_extracted_audio(samples, extract_audio, extract_workers)
    return samples

Modules

fastvideo.eval.api

Classes

Functions

fastvideo.eval.api.evaluate
evaluate(generated: Tensor | str | Path, reference: Tensor | str | Path | None = None, metrics: list[str] | str = 'all', device: str = 'cuda', **kwargs) -> dict[str, MetricResult] | list[dict[str, MetricResult]]

One-shot evaluation. For repeated use, prefer :func:create_evaluator.

Parameters

generated : Tensor | str | Path Generated video. Either a pre-loaded (T, C, H, W) tensor or a path to an mp4/avi/etc. — paths are decoded by the worker. reference : Tensor | str | Path | None Reference video (same accepted shapes as generated). metrics : list[str] | str Metric names, or "all". device : str PyTorch device string.

Source code in fastvideo/eval/api.py
def evaluate(
    generated: torch.Tensor | str | Path,
    reference: torch.Tensor | str | Path | None = None,
    metrics: list[str] | str = "all",
    device: str = "cuda",
    **kwargs,
) -> dict[str, MetricResult] | list[dict[str, MetricResult]]:
    """One-shot evaluation. For repeated use, prefer :func:`create_evaluator`.

    Parameters
    ----------
    generated : Tensor | str | Path
        Generated video. Either a pre-loaded ``(T, C, H, W)`` tensor or a
        path to an mp4/avi/etc. — paths are decoded by the worker.
    reference : Tensor | str | Path | None
        Reference video (same accepted shapes as *generated*).
    metrics : list[str] | str
        Metric names, or ``"all"``.
    device : str
        PyTorch device string.
    """
    ev = create_evaluator(metrics=metrics, device=device)
    kw: dict = {"video": generated, **kwargs}
    if reference is not None:
        kw["reference"] = reference
    return ev.evaluate(**kw)

fastvideo.eval.datasets

Prompt-corpus datasets for end-to-end benchmark evaluation.

Public API mirrors :mod:fastvideo.eval (metrics side):

from fastvideo.eval.datasets import (
    PromptDataset, Sample,
    register_dataset, get_dataset, list_datasets,
)

A dataset is an iterable of plain dicts (one per sample). Built-in datasets self-register at import time. To add one, drop a module into this package that subclasses :class:PromptDataset and decorates with @register_dataset("name") — auto-discovery picks it up.

Classes

fastvideo.eval.datasets.PromptDataset
PromptDataset()

Iterable corpus of sample dicts. Subclasses populate self._rows.

Source code in fastvideo/eval/datasets/base.py
def __init__(self) -> None:
    self._rows: list[dict] = []
Functions
fastvideo.eval.datasets.PromptDataset.by_dimension
by_dimension() -> dict[str, list[dict]]

Group samples by dimension. A multi-dim sample appears under each.

Source code in fastvideo/eval/datasets/base.py
def by_dimension(self) -> dict[str, list[dict]]:
    """Group samples by dimension. A multi-dim sample appears under each."""
    out: dict[str, list[dict]] = {}
    for s in self._rows:
        for d in s.get("dimensions", ()):
            out.setdefault(d, []).append(s)
    return out
fastvideo.eval.datasets.Sample

Bases: TypedDict

Documented schema for a row yielded by :class:PromptDataset.

Only prompt is required. Extra keys beyond these are forwarded to the runner's eval-kwargs builder verbatim, so action-conditioned or audio-bearing benchmarks can add their own fields without changing the base class.

fastvideo.eval.datasets.VBenchPromptDataset
VBenchPromptDataset(dimensions: list[str] | str = 'all', full_info_path: str | Path | None = None)

Bases: PromptDataset

VBench prompts filtered by evaluation dimension.

Parameters:

Name Type Description Default
dimensions list[str] | str

List of dimension names, or "all". Unknown dimensions raise ValueError.

'all'
full_info_path str | Path | None

Optional override for VBench_full_info.json; defaults to autodetection.

None

A prompt that belongs to several requested dimensions is yielded once; its dimensions list carries all matches so the scorer can route.

Source code in fastvideo/eval/datasets/vbench.py
def __init__(
    self,
    dimensions: list[str] | str = "all",
    full_info_path: str | Path | None = None,
) -> None:
    super().__init__()
    path = Path(full_info_path) if full_info_path else _locate_full_info()
    with path.open() as f:
        entries = json.load(f)

    all_dims = sorted({d for e in entries for d in e["dimension"]})
    if dimensions == "all":
        self.dimensions: list[str] = all_dims
    else:
        unknown = set(dimensions) - set(all_dims)
        if unknown:
            raise ValueError(f"Unknown VBench dimensions: {sorted(unknown)}. "
                             f"Available: {all_dims}")
        self.dimensions = list(dimensions)

    wanted = set(self.dimensions)
    for entry in entries:
        relevant = [d for d in entry["dimension"] if d in wanted]
        if not relevant:
            continue
        n = (TEMPORAL_FLICKERING_SAMPLES if "temporal_flickering" in relevant else DEFAULT_SAMPLES)

        # Strip the outer {dim_name: ...} wrapper from upstream's aux
        # schema so every metric reads its inputs from a flat dict.
        #
        # This unwraps exactly one level — the dimension key. Whatever
        # shape lives inside is the metric's contract:
        #
        #   color:                {"color": {"color": "red"}}
        #     → flat: {"color": "red"}                  (scalar)
        #
        #   object_class:         {"object_class": {"object": "person"}}
        #     → flat: {"object": "person"}              (scalar)
        #
        #   multiple_objects:     {"multiple_objects": {"object": "a and b"}}
        #     → flat: {"object": "a and b"}             (scalar)
        #
        #   spatial_relationship: {"spatial_relationship":
        #                            {"spatial_relationship":
        #                                {"object_a": ..., "object_b": ...,
        #                                 "relationship": ...}}}
        #     → flat: {"spatial_relationship": {object_a,object_b,relationship}}
        #
        # Note the spatial_relationship case keeps a nested inner dict
        # by design — upstream double-wraps it, the SpatialRelationship
        # metric reads ``aux["spatial_relationship"]`` expecting that
        # inner dict. Don't "simplify" the wrapping away.
        raw_aux = entry.get("auxiliary_info") or {}
        flat_aux: dict = {}
        for v in raw_aux.values():
            if isinstance(v, dict):
                flat_aux.update(v)

        self._rows.append({
            "prompt": entry["prompt_en"],
            "n_samples": n,
            "dimensions": relevant,
            "auxiliary_info": flat_aux,
        })
    self.full_info_path = path

Functions

fastvideo.eval.datasets.get_dataset
get_dataset(name: str, **kwargs: Any) -> BasePromptDataset

Instantiate a registered dataset by name.

Source code in fastvideo/eval/datasets/registry.py
def get_dataset(name: str, **kwargs: Any) -> BasePromptDataset:
    """Instantiate a registered dataset by name."""
    cls = _REGISTRY.get(name)
    if cls is None:
        available = ", ".join(sorted(_REGISTRY.keys()))
        raise KeyError(f"Unknown dataset '{name}'. Available: {available}")
    return cls(**kwargs)
fastvideo.eval.datasets.list_datasets
list_datasets() -> list[str]

Return sorted list of all registered dataset names.

Source code in fastvideo/eval/datasets/registry.py
def list_datasets() -> list[str]:
    """Return sorted list of all registered dataset names."""
    return sorted(_REGISTRY.keys())
fastvideo.eval.datasets.register_dataset
register_dataset(name: str)

Decorator to register a prompt-dataset class.

Usage::

@register_dataset("vbench")
class VBenchPromptDataset(BasePromptDataset):
    ...
Source code in fastvideo/eval/datasets/registry.py
def register_dataset(name: str):
    """Decorator to register a prompt-dataset class.

    Usage::

        @register_dataset("vbench")
        class VBenchPromptDataset(BasePromptDataset):
            ...
    """

    def wrapper(cls):
        cls.name = name
        _REGISTRY[name] = cls
        return cls

    return wrapper

Modules

fastvideo.eval.datasets.base

Prompt-corpus datasets.

A :class:PromptDataset is an iterable of sample dicts describing the prompts and conditions for a benchmark. Each sample is a plain dict — no dataclass, no schema enforcement — that flows directly into both generation (VideoGenerator.generate_video(**sample)) and scoring (Evaluator.evaluate(**eval_kwargs)). The runner picks well-known keys (prompt, n_samples, dimensions, auxiliary_info, ...) and passes the rest through.

This matches the surrounding FastVideo style:

  • :class:fastvideo.dataset.validation_dataset.ValidationDataset yields dicts.
  • :meth:fastvideo.VideoGenerator.generate_video consumes **kwargs.
  • :meth:fastvideo.eval.Evaluator.evaluate consumes **kwargs.

To add a new benchmark:

  1. Subclass :class:PromptDataset, populate self._rows with dicts in __init__.
  2. Decorate with @register_dataset("my_bench").

Convention for auxiliary_info: a flat dict of metric-keyed values (e.g. {"color": "red"}). Benchmarks with nested aux schemas (VBench's {dim: {key: val}}) flatten at load time so every consumer sees the same shape.

Classes
fastvideo.eval.datasets.base.PromptDataset
PromptDataset()

Iterable corpus of sample dicts. Subclasses populate self._rows.

Source code in fastvideo/eval/datasets/base.py
def __init__(self) -> None:
    self._rows: list[dict] = []
Functions
fastvideo.eval.datasets.base.PromptDataset.by_dimension
by_dimension() -> dict[str, list[dict]]

Group samples by dimension. A multi-dim sample appears under each.

Source code in fastvideo/eval/datasets/base.py
def by_dimension(self) -> dict[str, list[dict]]:
    """Group samples by dimension. A multi-dim sample appears under each."""
    out: dict[str, list[dict]] = {}
    for s in self._rows:
        for d in s.get("dimensions", ()):
            out.setdefault(d, []).append(s)
    return out
fastvideo.eval.datasets.base.Sample

Bases: TypedDict

Documented schema for a row yielded by :class:PromptDataset.

Only prompt is required. Extra keys beyond these are forwarded to the runner's eval-kwargs builder verbatim, so action-conditioned or audio-bearing benchmarks can add their own fields without changing the base class.

fastvideo.eval.datasets.physics_iq

Physics-IQ benchmark prompt corpus.

Yields one sample dict per take-1 scenario, paired with its take-2 reference and both takes' real motion masks. Each row drops straight into :meth:fastvideo.eval.Evaluator.evaluate for the physics_iq metric:

{
    "prompt": <description>,
    "reference":            "<take-1 mp4>",
    "reference_take2":      "<take-2 mp4>",
    "reference_mask":       "<take-1 mask mp4>",
    "reference_take2_mask": "<take-2 mask mp4>",
    "scenario": <scenario_id>,
    "view": <camera view>,
    "auxiliary_info": { ... metadata ... },
}

Self-contained dataset: the manifest CSV is vendored under fastvideo/eval/metrics/physics_iq/_vendored/descriptions.csv; per-scenario videos/masks/switch-frames auto-fetch on first use from the public DeepMind bucket into ${FASTVIDEO_EVAL_CACHE}/datasets/physics_iq/. Pass auto_download=False (or dataset_root= pointing at a pre-downloaded copy) to opt out of network fetches.

Classes
fastvideo.eval.datasets.physics_iq.PhysicsIQPromptDataset
PhysicsIQPromptDataset(dataset_root: str | Path | None = None, *, fps: int = _DEFAULT_FPS, limit: int | None = None, generated_dir: str | Path | None = None, auto_download: bool = True)

Bases: PromptDataset

Physics-IQ benchmark prompt corpus.

Self-contained: get_dataset("physics_iq") works with no kwargs. The manifest CSV is vendored next to the metric, and per-scenario assets auto-fetch on first miss from the public bucket into ${FASTVIDEO_EVAL_CACHE}/datasets/physics_iq/.

Parameters:

Name Type Description Default
dataset_root str | Path | None

path to a pre-downloaded copy of the Physics-IQ release. Defaults to ${FASTVIDEO_EVAL_CACHE}/datasets/physics_iq; override only if you already have a local mirror.

None
fps int

target frame rate. The release ships at 30 FPS; other rates transcode once on first access into <root>/.physics_iq_cache/.

_DEFAULT_FPS
limit int | None

optional truncation for quick smoke runs. Apply this kwarg (not a post-construction slice) so we only fetch the assets for the scenarios actually requested.

None
generated_dir str | Path | None

optional directory of pre-generated videos — attaches each manifest row's expected output path to the sample dict under auxiliary_info["generated_video_path"].

None
auto_download bool

when True (the default), missing testing videos, masks, and switch frames are fetched from the public bucket into dataset_root. Set False for air-gapped runs; the loader will then raise FileNotFoundError on miss.

True
Source code in fastvideo/eval/datasets/physics_iq.py
def __init__(
    self,
    dataset_root: str | Path | None = None,
    *,
    fps: int = _DEFAULT_FPS,
    limit: int | None = None,
    generated_dir: str | Path | None = None,
    auto_download: bool = True,
) -> None:
    super().__init__()

    repo_root = Path(dataset_root or _default_dataset_root()).expanduser().resolve()
    self.repo_root = repo_root
    self.dataset_dir = _resolve_dataset_dir(repo_root)
    self.descriptions_path = _resolve_descriptions_path(repo_root, self.dataset_dir)
    self.cache_dir = repo_root / ".physics_iq_cache"
    self.fps = fps
    self.auto_download = auto_download
    self.bucket_url = _bucket_url()

    scenarios = self._iter_scenarios(
        fps=fps,
        generated_dir=generated_dir,
        limit=limit,
    )
    self._rows = [_scenario_to_row(s) for s in scenarios]
fastvideo.eval.datasets.physics_iq.PhysicsIQScenario dataclass
PhysicsIQScenario(scenario_id: str, view: str, scenario_name: str, take1_video_path: str, take2_video_path: str, switch_frame_path: str, caption: str, expected_gen_filename: str, generated_video_path: str | None = None, take1_mask_path: str | None = None, take2_mask_path: str | None = None)

One row of the Physics-IQ manifest, fully resolved on disk.

Functions
fastvideo.eval.datasets.registry

Registry for prompt-corpus datasets, mirroring :mod:fastvideo.eval.registry.

Functions
fastvideo.eval.datasets.registry.get_dataset
get_dataset(name: str, **kwargs: Any) -> BasePromptDataset

Instantiate a registered dataset by name.

Source code in fastvideo/eval/datasets/registry.py
def get_dataset(name: str, **kwargs: Any) -> BasePromptDataset:
    """Instantiate a registered dataset by name."""
    cls = _REGISTRY.get(name)
    if cls is None:
        available = ", ".join(sorted(_REGISTRY.keys()))
        raise KeyError(f"Unknown dataset '{name}'. Available: {available}")
    return cls(**kwargs)
fastvideo.eval.datasets.registry.list_datasets
list_datasets() -> list[str]

Return sorted list of all registered dataset names.

Source code in fastvideo/eval/datasets/registry.py
def list_datasets() -> list[str]:
    """Return sorted list of all registered dataset names."""
    return sorted(_REGISTRY.keys())
fastvideo.eval.datasets.registry.register_dataset
register_dataset(name: str)

Decorator to register a prompt-dataset class.

Usage::

@register_dataset("vbench")
class VBenchPromptDataset(BasePromptDataset):
    ...
Source code in fastvideo/eval/datasets/registry.py
def register_dataset(name: str):
    """Decorator to register a prompt-dataset class.

    Usage::

        @register_dataset("vbench")
        class VBenchPromptDataset(BasePromptDataset):
            ...
    """

    def wrapper(cls):
        cls.name = name
        _REGISTRY[name] = cls
        return cls

    return wrapper
fastvideo.eval.datasets.vbench

VBench prompt corpus.

Single source of truth: upstream's VBench_full_info.json (946 entries, each with prompt_en, a dimension list, optional auxiliary_info keyed by dimension).

Classes
fastvideo.eval.datasets.vbench.VBenchPromptDataset
VBenchPromptDataset(dimensions: list[str] | str = 'all', full_info_path: str | Path | None = None)

Bases: PromptDataset

VBench prompts filtered by evaluation dimension.

Parameters:

Name Type Description Default
dimensions list[str] | str

List of dimension names, or "all". Unknown dimensions raise ValueError.

'all'
full_info_path str | Path | None

Optional override for VBench_full_info.json; defaults to autodetection.

None

A prompt that belongs to several requested dimensions is yielded once; its dimensions list carries all matches so the scorer can route.

Source code in fastvideo/eval/datasets/vbench.py
def __init__(
    self,
    dimensions: list[str] | str = "all",
    full_info_path: str | Path | None = None,
) -> None:
    super().__init__()
    path = Path(full_info_path) if full_info_path else _locate_full_info()
    with path.open() as f:
        entries = json.load(f)

    all_dims = sorted({d for e in entries for d in e["dimension"]})
    if dimensions == "all":
        self.dimensions: list[str] = all_dims
    else:
        unknown = set(dimensions) - set(all_dims)
        if unknown:
            raise ValueError(f"Unknown VBench dimensions: {sorted(unknown)}. "
                             f"Available: {all_dims}")
        self.dimensions = list(dimensions)

    wanted = set(self.dimensions)
    for entry in entries:
        relevant = [d for d in entry["dimension"] if d in wanted]
        if not relevant:
            continue
        n = (TEMPORAL_FLICKERING_SAMPLES if "temporal_flickering" in relevant else DEFAULT_SAMPLES)

        # Strip the outer {dim_name: ...} wrapper from upstream's aux
        # schema so every metric reads its inputs from a flat dict.
        #
        # This unwraps exactly one level — the dimension key. Whatever
        # shape lives inside is the metric's contract:
        #
        #   color:                {"color": {"color": "red"}}
        #     → flat: {"color": "red"}                  (scalar)
        #
        #   object_class:         {"object_class": {"object": "person"}}
        #     → flat: {"object": "person"}              (scalar)
        #
        #   multiple_objects:     {"multiple_objects": {"object": "a and b"}}
        #     → flat: {"object": "a and b"}             (scalar)
        #
        #   spatial_relationship: {"spatial_relationship":
        #                            {"spatial_relationship":
        #                                {"object_a": ..., "object_b": ...,
        #                                 "relationship": ...}}}
        #     → flat: {"spatial_relationship": {object_a,object_b,relationship}}
        #
        # Note the spatial_relationship case keeps a nested inner dict
        # by design — upstream double-wraps it, the SpatialRelationship
        # metric reads ``aux["spatial_relationship"]`` expecting that
        # inner dict. Don't "simplify" the wrapping away.
        raw_aux = entry.get("auxiliary_info") or {}
        flat_aux: dict = {}
        for v in raw_aux.values():
            if isinstance(v, dict):
                flat_aux.update(v)

        self._rows.append({
            "prompt": entry["prompt_en"],
            "n_samples": n,
            "dimensions": relevant,
            "auxiliary_info": flat_aux,
        })
    self.full_info_path = path
Functions

fastvideo.eval.evaluator

User-facing scorer.

Layering (mirrors FastVideo's VideoGenerator → Worker pattern, but in-process)::

Evaluator               ← user-facing
  └── EvalWorker × N    ← single-GPU; owns metric replicas
  └── VideoPool         ← async path-→-tensor prefetch (per evaluate call)

The constructor builds one :class:EvalWorker per GPU and loads every metric on every worker eagerly. :meth:evaluate is the single entry point: pass kwargs for one sample, or pass a list of sample dicts to fan-out across GPU replicas with pipelined decoding — same method, return type follows the input shape.

Classes

fastvideo.eval.evaluator.Evaluator
Evaluator(metrics: list[str] | str = 'all', device: str = 'cuda:0', num_gpus: int = 1, compile: bool = False, *, loader_threads: int = 1, prefetch_factor: int = 2, pre_upload: bool = True, skip_missing_deps: bool = False)

Pre-initialized scorer for repeated evaluation.

Parameters

metrics : list[str] | str Metric names, group prefixes ("vbench"), or "all". device : str Single-GPU device (e.g. "cuda:0"). Ignored when num_gpus > 1. num_gpus : int Number of GPU replicas. Each gets its own :class:EvalWorker. compile : bool Apply :func:torch.compile to each metric's _model. loader_threads : int Background decode threads in the :class:VideoPool. Default 1 (hide decode behind compute). Bump for I/O-heavy benchmark sets where one loader can't keep up with the workers. prefetch_factor : int pool max_size = prefetch_factor * num_workers. Default 2 — one sample being consumed, one prefetched per worker. pre_upload : bool When True (default), the worker performs a single host→device upload of video / reference per sample before the metric loop, and every metric reads from that shared GPU-resident tensor. Without it, each metric pays its own .to(self.device) — N transfers of the same clip for N metrics, which dominates at high resolution. Set False for training-time eval, where keeping a clip resident on GPU across the metric loop would fight the training step for VRAM. skip_missing_deps : bool When True, silently drop explicit metric names whose optional deps aren't importable (with a one-line warning per skipped metric). Default False — an explicit name with a missing dep raises :class:ImportError at construction time. Group selectors ("vbench", "all") always silent-skip regardless of this flag.

Source code in fastvideo/eval/evaluator.py
def __init__(
    self,
    metrics: list[str] | str = "all",
    device: str = "cuda:0",
    num_gpus: int = 1,
    compile: bool = False,
    *,
    loader_threads: int = 1,
    prefetch_factor: int = 2,
    pre_upload: bool = True,
    skip_missing_deps: bool = False,
) -> None:
    names = _resolve_metric_names(metrics, skip_missing_deps=skip_missing_deps)
    if num_gpus > 1:
        self._workers = [
            EvalWorker(names,
                       f"cuda:{i}",
                       compile=compile,
                       pre_upload=pre_upload,
                       skip_missing_deps=skip_missing_deps) for i in range(num_gpus)
        ]
    else:
        self._workers = [
            EvalWorker(names, device, compile=compile, pre_upload=pre_upload, skip_missing_deps=skip_missing_deps)
        ]
    self._loader_threads = max(1, loader_threads)
    self._prefetch_factor = max(1, prefetch_factor)
Functions
fastvideo.eval.evaluator.Evaluator.evaluate
evaluate(samples: Iterable[dict] | None = None, *, metrics: list[str] | None = None, **kwargs) -> dict[str, MetricResult] | EvalResults

Score one sample (kwargs form) or many samples (list form).

Both forms go through the same :class:VideoPool pipeline; video / reference paths are decoded asynchronously.

Parameters

samples : Iterable of sample dicts. Omit and pass kwargs for a single-sample call. metrics : Subset of this Evaluator's registered metrics to actually run on this batch. None (default) runs all registered. Lets a single long-lived Evaluator score different (gen, ref) corpora with different metric subsets across multiple evaluate() calls — e.g. LPIPS on a paired corpus, FVD on an unequal-cardinality corpus — without burning model loads. Set-metric accumulators are reset only for the metrics included in metrics, so state for other set metrics is preserved across calls.

Single sample::

ev.evaluate(video=tensor, text_prompt="...", fps=24.0)

Many samples::

ev.evaluate(samples=[{"video": ..., "reference": ...}, ...])

Many samples with a metric filter::

ev.evaluate(samples=lpips_samples, metrics=["common.lpips"])
ev.evaluate(samples=fvd_samples,   metrics=["common.fvd"])
Returns

dict[str, MetricResult] for the single-sample form; :class:EvalResults (list-of-dict subclass with .corpus) for the list form.

Source code in fastvideo/eval/evaluator.py
def evaluate(
    self,
    samples: Iterable[dict] | None = None,
    *,
    metrics: list[str] | None = None,
    **kwargs,
) -> dict[str, MetricResult] | EvalResults:
    """Score one sample (kwargs form) or many samples (list form).

    Both forms go through the same :class:`VideoPool` pipeline;
    ``video`` / ``reference`` paths are decoded asynchronously.

    Parameters
    ----------
    samples :
        Iterable of sample dicts.  Omit and pass kwargs for a
        single-sample call.
    metrics :
        Subset of this Evaluator's registered metrics to actually
        run on this batch.  ``None`` (default) runs all registered.
        Lets a single long-lived Evaluator score different (gen,
        ref) corpora with different metric subsets across multiple
        ``evaluate()`` calls — e.g. LPIPS on a paired corpus, FVD
        on an unequal-cardinality corpus — without burning model
        loads.  Set-metric accumulators are reset only for the
        metrics included in *metrics*, so state for other set
        metrics is preserved across calls.

    Single sample::

        ev.evaluate(video=tensor, text_prompt="...", fps=24.0)

    Many samples::

        ev.evaluate(samples=[{"video": ..., "reference": ...}, ...])

    Many samples with a metric filter::

        ev.evaluate(samples=lpips_samples, metrics=["common.lpips"])
        ev.evaluate(samples=fvd_samples,   metrics=["common.fvd"])

    Returns
    -------
    dict[str, MetricResult] for the single-sample form;
    :class:`EvalResults` (list-of-dict subclass with ``.corpus``) for
    the list form.
    """
    if metrics is not None:
        unknown = [m for m in metrics if m not in self.metric_names]
        if unknown:
            raise ValueError(f"metrics filter contains names not registered on this Evaluator: "
                             f"{unknown}; registered: {self.metric_names}")

    single = samples is None
    sample_list: list[dict] = [kwargs] if samples is None else list(samples)
    if not sample_list:
        return EvalResults(samples=[], corpus={})

    if single:
        set_names = self._workers[0].set_metrics().keys()
        active_set = (set_names if metrics is None else (set_names & set(metrics)))
        if active_set:
            # Set metrics need a population. A single sample can't produce a
            # meaningful corpus result, and silently discarding it (return
            # ``per_sample[0]`` only) hides the no-op. Force the list form.
            raise ValueError("Set-vs-set metrics require samples=[...] with >=2 entries; "
                             "the kwargs form (single sample) cannot produce a corpus "
                             f"result. Active set metrics: {sorted(active_set)}")

    per_sample, corpus = self._run(sample_list, metric_filter=metrics)

    if single:
        return per_sample[0]
    return EvalResults(samples=per_sample, corpus=corpus)
fastvideo.eval.evaluator.Evaluator.release_cuda_memory
release_cuda_memory() -> None

Free CUDA caches on every replica without dropping models.

Source code in fastvideo/eval/evaluator.py
def release_cuda_memory(self) -> None:
    """Free CUDA caches on every replica without dropping models."""
    for w in self._workers:
        w.release_cuda_memory()
fastvideo.eval.evaluator.Evaluator.reload
reload() -> None

Rebuild metrics dropped by :meth:unload.

Source code in fastvideo/eval/evaluator.py
def reload(self) -> None:
    """Rebuild metrics dropped by :meth:`unload`."""
    for w in self._workers:
        w.reload()
fastvideo.eval.evaluator.Evaluator.shutdown
shutdown() -> None

No-op; kept for API compatibility with older callers.

Source code in fastvideo/eval/evaluator.py
def shutdown(self) -> None:
    """No-op; kept for API compatibility with older callers."""
fastvideo.eval.evaluator.Evaluator.unload
unload() -> None

Drop metric refs on every replica. Reverse with :meth:reload.

Source code in fastvideo/eval/evaluator.py
def unload(self) -> None:
    """Drop metric refs on every replica. Reverse with :meth:`reload`."""
    for w in self._workers:
        w.unload()

Functions

fastvideo.eval.io

Classes

fastvideo.eval.io.NoAudioStreamError

Bases: ValueError

Raised when a video file has no audio stream to extract.

Functions

fastvideo.eval.io.as_video
as_video(x: str | Path | Tensor | Video) -> Video

Coerce path/tensor/Video → :class:Video for the pool to decode.

Path strings and :class:pathlib.Path become Video(source=str(x)); the pool then calls :func:load_video on first use. Tensors become Video(source=None, frames=x) — the pool sees .frames already populated and forwards untouched. :class:Video instances pass through.

Source code in fastvideo/eval/io/inputs.py
def as_video(x: str | Path | torch.Tensor | Video) -> Video:
    """Coerce path/tensor/Video → :class:`Video` for the pool to decode.

    Path strings and :class:`pathlib.Path` become ``Video(source=str(x))``;
    the pool then calls :func:`load_video` on first use.  Tensors become
    ``Video(source=None, frames=x)`` — the pool sees ``.frames`` already
    populated and forwards untouched.  :class:`Video` instances pass through.
    """
    if isinstance(x, Video):
        return x
    if isinstance(x, str | Path):
        return Video(source=str(x))
    if isinstance(x, torch.Tensor):
        return Video(source=None, frames=x)
    raise TypeError(f"Cannot coerce {type(x).__name__} to Video")
fastvideo.eval.io.build_eval_kwargs
build_eval_kwargs(row: dict, video_path: Path, *, fps: float = 24.0) -> dict[str, Any]

Build evaluator kwargs from a sample row + a video on disk.

Loads the video as (T,C,H,W) and adds the leading batch dim. Forwards prompt (as scalar text_prompt) and auxiliary_info (as scalar dict) when present on the row — matches the one-sample-per-call contract that the evaluator and every metric assume.

Source code in fastvideo/eval/io/paths.py
def build_eval_kwargs(row: dict, video_path: Path, *, fps: float = 24.0) -> dict[str, Any]:
    """Build evaluator kwargs from a sample row + a video on disk.

    Loads the video as ``(T,C,H,W)`` and adds the leading batch dim.
    Forwards ``prompt`` (as scalar ``text_prompt``) and
    ``auxiliary_info`` (as scalar dict) when present on the row —
    matches the one-sample-per-call contract that the evaluator and
    every metric assume.
    """
    from fastvideo.eval.io.video import load_video

    video = load_video(str(video_path))  # (T, C, H, W) in [0, 1]
    kwargs: dict[str, Any] = {
        "video": video.unsqueeze(0),  # (1, T, C, H, W)
        "fps": fps,
    }
    if "prompt" in row:
        kwargs["text_prompt"] = row["prompt"]
    aux = row.get("auxiliary_info")
    if aux:
        kwargs["auxiliary_info"] = aux
    return kwargs
fastvideo.eval.io.default_filename
default_filename(row: dict, idx: int, ext: str = '.mp4') -> str

<sanitized-prompt>-<idx>.mp4 — VBench-style.

Source code in fastvideo/eval/io/paths.py
def default_filename(row: dict, idx: int, ext: str = ".mp4") -> str:
    """``<sanitized-prompt>-<idx>.mp4`` — VBench-style."""
    return f"{sanitize_prompt(row['prompt'])}-{idx}{ext}"
fastvideo.eval.io.extract_audio_track
extract_audio_track(video_path: str | Path, *, output_dir: str | Path, sample_rate: int | None = None, codec: str = 'pcm_s16le') -> Path

Pull the first audio stream from video_path to a .wav in output_dir.

Idempotent — if output_dir / {stem}.wav already exists, the existing file is returned without re-decoding. This makes the helper safe to call repeatedly from parallel workers on overlapping inputs (the cache hit is the fast path).

Parameters

video_path : Source video file (anything PyAV can open: mp4, mkv, webm, …). output_dir : Directory the .wav lands in. Created if it doesn't exist. sample_rate : If set, resample to this rate. Default (None) preserves the source rate — most audio metrics resample again internally to their own target rate, so adding a resample step here would just be wasted work. codec : PCM codec for the output. pcm_s16le (default) gives 16-bit little-endian PCM, the most broadly compatible WAV format.

Returns

Path Absolute path to the extracted .wav.

Raises

NoAudioStreamError If video_path has no audio stream.

Source code in fastvideo/eval/io/audio.py
def extract_audio_track(
    video_path: str | Path,
    *,
    output_dir: str | Path,
    sample_rate: int | None = None,
    codec: str = "pcm_s16le",
) -> Path:
    """Pull the first audio stream from *video_path* to a ``.wav`` in *output_dir*.

    Idempotent — if ``output_dir / {stem}.wav`` already exists, the
    existing file is returned without re-decoding.  This makes the
    helper safe to call repeatedly from parallel workers on overlapping
    inputs (the cache hit is the fast path).

    Parameters
    ----------
    video_path :
        Source video file (anything PyAV can open: mp4, mkv, webm, …).
    output_dir :
        Directory the ``.wav`` lands in.  Created if it doesn't exist.
    sample_rate :
        If set, resample to this rate.  Default (``None``) preserves the
        source rate — most audio metrics resample again internally to
        their own target rate, so adding a resample step here would just
        be wasted work.
    codec :
        PCM codec for the output.  ``pcm_s16le`` (default) gives 16-bit
        little-endian PCM, the most broadly compatible WAV format.

    Returns
    -------
    Path
        Absolute path to the extracted ``.wav``.

    Raises
    ------
    NoAudioStreamError
        If *video_path* has no audio stream.
    """
    import av

    in_path = Path(video_path)
    out_dir = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
    out_path = (out_dir / f"{in_path.stem}.wav").resolve()
    if out_path.exists():
        return out_path

    container = av.open(str(in_path))
    try:
        audio_streams = [s for s in container.streams if s.type == "audio"]
        if not audio_streams:
            raise NoAudioStreamError(f"No audio stream in {in_path}")
        in_stream = audio_streams[0]
        target_rate = sample_rate or in_stream.rate

        out = av.open(str(out_path), "w", format="wav")
        try:
            out_stream = out.add_stream(codec, rate=target_rate)
            # Match input layout (mono / stereo / surround) — audio metrics
            # that want mono will downmix themselves.
            if in_stream.layout is not None:
                out_stream.layout = in_stream.layout
            resampler = None
            if sample_rate is not None and sample_rate != in_stream.rate:
                resampler = av.AudioResampler(format=out_stream.format, layout=out_stream.layout, rate=target_rate)
            for frame in container.decode(in_stream):
                frames_out = resampler.resample(frame) if resampler is not None else [frame]
                for f in frames_out:
                    # PyAV needs pts cleared so the encoder reassigns.
                    f.pts = None
                    for packet in out_stream.encode(f):
                        out.mux(packet)
            for packet in out_stream.encode():  # flush
                out.mux(packet)
        finally:
            out.close()
    finally:
        container.close()
    return out_path
fastvideo.eval.io.extract_frames
extract_frames(video: Tensor, n_frames: int | None = None) -> Tensor

Uniformly sample n_frames from a (T, C, H, W) video tensor.

Source code in fastvideo/eval/io/video.py
def extract_frames(video: torch.Tensor, n_frames: int | None = None) -> torch.Tensor:
    """Uniformly sample *n_frames* from a ``(T, C, H, W)`` video tensor."""
    if n_frames is None or n_frames >= video.shape[0]:
        return video
    indices = torch.linspace(0, video.shape[0] - 1, n_frames).long()
    return video[indices]
fastvideo.eval.io.glob_videos
glob_videos(videos_dir: Path, row: dict, ext: str = '.mp4') -> list[Path]

Find every generated video for row, sorted by trailing -<idx>.

Source code in fastvideo/eval/io/paths.py
def glob_videos(videos_dir: Path, row: dict, ext: str = ".mp4") -> list[Path]:
    """Find every generated video for *row*, sorted by trailing ``-<idx>``."""
    pattern = f"{sanitize_prompt(row['prompt'])}-*{ext}"
    files = list(videos_dir.glob(pattern))

    def _idx(p: Path) -> int:
        try:
            return int(p.stem.rsplit("-", 1)[1])
        except (IndexError, ValueError):
            return -1

    return sorted(files, key=_idx)
fastvideo.eval.io.load_video
load_video(source: str | Tensor | list, **kwargs) -> Tensor

Load a video as a (T, C, H, W) float32 tensor in [0, 1].

Supported source types:

  • str / Path – path to .mp4 / .avi / .gif file, or a directory of frame images (sorted alphabetically).
  • torch.Tensor – returned as-is after shape validation.
  • list[PIL.Image] – stacked into a tensor.
Source code in fastvideo/eval/io/video.py
def load_video(source: str | torch.Tensor | list, **kwargs) -> torch.Tensor:
    """Load a video as a ``(T, C, H, W)`` float32 tensor in ``[0, 1]``.

    Supported *source* types:

    * **str / Path** – path to ``.mp4`` / ``.avi`` / ``.gif`` file, or a
      directory of frame images (sorted alphabetically).
    * **torch.Tensor** – returned as-is after shape validation.
    * **list[PIL.Image]** – stacked into a tensor.
    """
    if isinstance(source, torch.Tensor):
        if source.ndim != 4:
            raise ValueError(f"Expected video tensor with 4 dims (T,C,H,W), got {source.ndim}")
        return source.float()

    if isinstance(source, list):
        frames = [_pil_to_tensor(img) for img in source]
        return torch.stack(frames)

    path = Path(source)
    if path.is_dir():
        return _load_frame_dir(path)
    return _load_video_file(str(path))
fastvideo.eval.io.samples_from
samples_from(*, video: PathSpec | None = None, reference: PathSpec | None = None, audio: PathSpec | None = None, reference_audio: PathSpec | None = None, text_prompt: str | None = None, text_prompts: str | Path | list[str] | None = None, fps: float | None = None, auxiliary_info: dict | list[dict] | None = None, extras: dict | list[dict] | None = None, extract_audio: bool | str | Path = False, extract_workers: int = 4) -> list[dict]

Build a samples list from path-style inputs.

Parameters

video, reference, audio, reference_audio : File path, directory of files (sorted by name), or any iterable of paths. Pass whichever modalities apply to the metrics you plan to run — they attach to sample["video"] / sample["reference"] / sample["audio"] / sample["reference_audio"] respectively. Video paths are wrapped in :class:Video so :class:VideoPool decodes them lazily in parallel; audio paths stay as strings (audio metrics each load with their own resample / preprocess). text_prompt : A single prompt string broadcast onto every sample. text_prompts : A list of strings (one per sample), or a path to a .jsonl / .json file containing per-sample prompts. fps : Scalar fps broadcast onto every sample. auxiliary_info : Single dict (broadcast) or list of dicts (zipped) for sample["auxiliary_info"] — vbench structured-prompt metrics read this. extras : Catch-all per-sample attachments. Use for metric-specific keys the dedicated kwargs don't cover (scenario, view, actions, calibration, reference_take2, ...). Pass a single dict to broadcast or a list-of-dicts to zip; the keys merge into each sample dict. extract_audio : If truthy, auto-extract audio from each video / reference source into .wav files via PyAV and attach the paths under sample["audio"] / sample["reference_audio"]. Pass a path for a persistent cache, True for a tempdir. Skipped silently for videos with no audio stream; ignored wherever audio / reference_audio is already explicit. extract_workers : Parallel workers for extract_audio.

Returns

list[dict] Canonical samples shape — hand directly to :meth:Evaluator.evaluate.

Cardinality and shape

Let N = len(generated inputs) (the agreed length of whichever of video / audio you passed). References are attached 1:1 onto the first N samples; any extras (when |ref| > N) become standalone role-tagged samples at the end of the list, so set metrics like FVD see the full reference corpus while per-sample paired metrics like LPIPS only run on the first N pairs.

Notes

"Missing" keys are simply absent from the sample dict. Metrics handle them per their own contract (sample.get(...) for optional, sample[...] raises for required, or :meth:BaseMetric._skip for opt-in skip behavior). One fat samples list with many keys can serve many metrics — each reads its subset.

Source code in fastvideo/eval/io/inputs.py
def samples_from(
    *,
    # Modality inputs — pass whichever you have, in any combination.
    video: PathSpec | None = None,
    reference: PathSpec | None = None,
    audio: PathSpec | None = None,
    reference_audio: PathSpec | None = None,
    # Per-sample attachments.  Scalars broadcast to every sample; lists
    # / jsonl paths zip per-sample.
    text_prompt: str | None = None,
    text_prompts: str | Path | list[str] | None = None,
    fps: float | None = None,
    auxiliary_info: dict | list[dict] | None = None,
    # Catch-all for exotic metric inputs (physics_iq scenarios,
    # synthetic_optical_flow actions, ...).  Single dict broadcasts;
    # list of dicts zips.  Keys merge into each sample dict.
    extras: dict | list[dict] | None = None,
    # Sugar: pull audio off the video sources via PyAV.  Pass a path
    # to use a persistent on-disk cache, ``True`` for a system tempdir.
    extract_audio: bool | str | Path = False,
    extract_workers: int = 4,
) -> list[dict]:
    """Build a samples list from path-style inputs.

    Parameters
    ----------
    video, reference, audio, reference_audio :
        File path, directory of files (sorted by name), or any iterable
        of paths.  Pass whichever modalities apply to the metrics you
        plan to run — they attach to ``sample["video"]`` /
        ``sample["reference"]`` / ``sample["audio"]`` /
        ``sample["reference_audio"]`` respectively.  Video paths are
        wrapped in :class:`Video` so :class:`VideoPool` decodes them
        lazily in parallel; audio paths stay as strings (audio metrics
        each load with their own resample / preprocess).
    text_prompt :
        A single prompt string broadcast onto every sample.
    text_prompts :
        A list of strings (one per sample), or a path to a ``.jsonl`` /
        ``.json`` file containing per-sample prompts.
    fps :
        Scalar fps broadcast onto every sample.
    auxiliary_info :
        Single dict (broadcast) or list of dicts (zipped) for
        ``sample["auxiliary_info"]`` — vbench structured-prompt
        metrics read this.
    extras :
        Catch-all per-sample attachments.  Use for metric-specific keys
        the dedicated kwargs don't cover (``scenario``, ``view``,
        ``actions``, ``calibration``, ``reference_take2``, ...).  Pass
        a single dict to broadcast or a list-of-dicts to zip; the keys
        merge into each sample dict.
    extract_audio :
        If truthy, auto-extract audio from each ``video`` /
        ``reference`` source into ``.wav`` files via PyAV and attach
        the paths under ``sample["audio"]`` / ``sample["reference_audio"]``.
        Pass a path for a persistent cache, ``True`` for a tempdir.
        Skipped silently for videos with no audio stream; ignored
        wherever ``audio`` / ``reference_audio`` is already explicit.
    extract_workers :
        Parallel workers for ``extract_audio``.

    Returns
    -------
    list[dict]
        Canonical samples shape — hand directly to
        :meth:`Evaluator.evaluate`.

    Cardinality and shape
    ---------------------
    Let ``N = len(generated inputs)`` (the agreed length of whichever
    of ``video`` / ``audio`` you passed).  References are attached
    1:1 onto the first N samples; any extras (when ``|ref| > N``)
    become standalone role-tagged samples at the end of the list, so
    set metrics like FVD see the full reference corpus while per-sample
    paired metrics like LPIPS only run on the first N pairs.

    Notes
    -----
    "Missing" keys are simply absent from the sample dict.  Metrics
    handle them per their own contract (``sample.get(...)`` for
    optional, ``sample[...]`` raises for required, or
    :meth:`BaseMetric._skip` for opt-in skip behavior).  One fat samples
    list with many keys can serve many metrics — each reads its subset.
    """
    gen_video = _expand(video, _VIDEO_EXTS) if video is not None else None
    gen_audio = _expand(audio, _AUDIO_EXTS) if audio is not None else None
    ref_video = _expand(reference, _VIDEO_EXTS) if reference is not None else None
    ref_audio = _expand(reference_audio, _AUDIO_EXTS) if reference_audio is not None else None

    if gen_video is None and gen_audio is None:
        raise ValueError("samples_from: pass at least one of video= or audio=")

    # All "generated" inputs must agree on length — they describe the
    # same N samples in different modalities.
    n_candidates = [len(x) for x in (gen_video, gen_audio) if x is not None]
    if len({*n_candidates}) != 1:
        raise ValueError(f"samples_from: generated inputs have inconsistent lengths {n_candidates}")
    n = n_candidates[0]

    prompts = _broadcast(text_prompts if text_prompts is not None else text_prompt, n, loader=_load_prompts)
    aux = _broadcast(auxiliary_info, n)
    extras_per_sample = _broadcast(extras, n)
    if (text_prompts is not None) and (text_prompt is not None):
        raise ValueError("Pass either text_prompt (broadcast) or text_prompts (per-sample), not both.")

    samples: list[dict[str, Any]] = []
    for i in range(n):
        s: dict[str, Any] = {}
        if gen_video is not None:
            s["video"] = as_video(gen_video[i])
        if gen_audio is not None:
            s["audio"] = str(gen_audio[i])
        if ref_video is not None and i < len(ref_video):
            s["reference"] = as_video(ref_video[i])
        if ref_audio is not None and i < len(ref_audio):
            s["reference_audio"] = str(ref_audio[i])
        if prompts is not None:
            s["text_prompt"] = prompts[i]
        if fps is not None:
            s["fps"] = fps
        if aux is not None:
            s["auxiliary_info"] = aux[i]
        if extras_per_sample is not None:
            s.update(extras_per_sample[i])
        samples.append(s)

    # Unmatched references → role-tagged set samples.  Per-sample paired
    # metrics skip these via the worker's role-skip rule; set metrics
    # (FVD, FAD) accumulate the features.
    if ref_video is not None and len(ref_video) > n:
        for v in ref_video[n:]:
            samples.append({"video": as_video(v), "role": "reference"})
    if ref_audio is not None and len(ref_audio) > n:
        for a in ref_audio[n:]:
            samples.append({"audio": str(a), "role": "reference"})

    if extract_audio:
        _attach_extracted_audio(samples, extract_audio, extract_workers)
    return samples
fastvideo.eval.io.sanitize_prompt
sanitize_prompt(prompt: str, max_len: int = 100) -> str

Prompt → safe filename stem.

Source code in fastvideo/eval/io/paths.py
def sanitize_prompt(prompt: str, max_len: int = 100) -> str:
    """Prompt → safe filename stem."""
    s = _INVALID_CHARS.sub("", prompt[:max_len]).strip().strip(".")
    return re.sub(r"\s+", " ", s) or "output"

Modules

fastvideo.eval.io.audio

Audio-extraction helper for video files.

Audio metrics (audio.clap_score, audio.frechet_distance, …) read file paths under sample["audio"] and do their own per-metric preprocessing (CLAP at 48 kHz, PaSST at 32 kHz, whisper at 16 kHz mono, …). When the source video carries an audio track (V2A / T2A-V model outputs), :func:samples_from(extract_audio=True) calls :func:extract_audio_track to pull a .wav next to it once per video.

Why a separate utility instead of in-pool decode: every audio metric wants a different sample rate / channel count / format, so the pool can't usefully pre-load audio into a single canonical tensor the way it pre-loads video frames. Paths-in / paths-out is the lingua franca for audio in this codebase.

Classes
fastvideo.eval.io.audio.NoAudioStreamError

Bases: ValueError

Raised when a video file has no audio stream to extract.

Functions
fastvideo.eval.io.audio.extract_audio_track
extract_audio_track(video_path: str | Path, *, output_dir: str | Path, sample_rate: int | None = None, codec: str = 'pcm_s16le') -> Path

Pull the first audio stream from video_path to a .wav in output_dir.

Idempotent — if output_dir / {stem}.wav already exists, the existing file is returned without re-decoding. This makes the helper safe to call repeatedly from parallel workers on overlapping inputs (the cache hit is the fast path).

Parameters

video_path : Source video file (anything PyAV can open: mp4, mkv, webm, …). output_dir : Directory the .wav lands in. Created if it doesn't exist. sample_rate : If set, resample to this rate. Default (None) preserves the source rate — most audio metrics resample again internally to their own target rate, so adding a resample step here would just be wasted work. codec : PCM codec for the output. pcm_s16le (default) gives 16-bit little-endian PCM, the most broadly compatible WAV format.

Returns

Path Absolute path to the extracted .wav.

Raises

NoAudioStreamError If video_path has no audio stream.

Source code in fastvideo/eval/io/audio.py
def extract_audio_track(
    video_path: str | Path,
    *,
    output_dir: str | Path,
    sample_rate: int | None = None,
    codec: str = "pcm_s16le",
) -> Path:
    """Pull the first audio stream from *video_path* to a ``.wav`` in *output_dir*.

    Idempotent — if ``output_dir / {stem}.wav`` already exists, the
    existing file is returned without re-decoding.  This makes the
    helper safe to call repeatedly from parallel workers on overlapping
    inputs (the cache hit is the fast path).

    Parameters
    ----------
    video_path :
        Source video file (anything PyAV can open: mp4, mkv, webm, …).
    output_dir :
        Directory the ``.wav`` lands in.  Created if it doesn't exist.
    sample_rate :
        If set, resample to this rate.  Default (``None``) preserves the
        source rate — most audio metrics resample again internally to
        their own target rate, so adding a resample step here would just
        be wasted work.
    codec :
        PCM codec for the output.  ``pcm_s16le`` (default) gives 16-bit
        little-endian PCM, the most broadly compatible WAV format.

    Returns
    -------
    Path
        Absolute path to the extracted ``.wav``.

    Raises
    ------
    NoAudioStreamError
        If *video_path* has no audio stream.
    """
    import av

    in_path = Path(video_path)
    out_dir = Path(output_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
    out_path = (out_dir / f"{in_path.stem}.wav").resolve()
    if out_path.exists():
        return out_path

    container = av.open(str(in_path))
    try:
        audio_streams = [s for s in container.streams if s.type == "audio"]
        if not audio_streams:
            raise NoAudioStreamError(f"No audio stream in {in_path}")
        in_stream = audio_streams[0]
        target_rate = sample_rate or in_stream.rate

        out = av.open(str(out_path), "w", format="wav")
        try:
            out_stream = out.add_stream(codec, rate=target_rate)
            # Match input layout (mono / stereo / surround) — audio metrics
            # that want mono will downmix themselves.
            if in_stream.layout is not None:
                out_stream.layout = in_stream.layout
            resampler = None
            if sample_rate is not None and sample_rate != in_stream.rate:
                resampler = av.AudioResampler(format=out_stream.format, layout=out_stream.layout, rate=target_rate)
            for frame in container.decode(in_stream):
                frames_out = resampler.resample(frame) if resampler is not None else [frame]
                for f in frames_out:
                    # PyAV needs pts cleared so the encoder reassigns.
                    f.pts = None
                    for packet in out_stream.encode(f):
                        out.mux(packet)
            for packet in out_stream.encode():  # flush
                out.mux(packet)
        finally:
            out.close()
    finally:
        container.close()
    return out_path
fastvideo.eval.io.inputs

Input-shape helpers: paths → samples list.

The :class:Evaluator and every metric consume the same internal representation — a list[dict] with one dict per sample. Building that list by hand is the largest source of ceremony in user scripts. :func:samples_from is a pure function that turns path-style inputs (generated videos / audio, optional paired references, optional per-sample prompts / fps / metadata) into the canonical samples list, ready to hand to :meth:Evaluator.evaluate.

Design rules:

  • No Evaluator state, no metric introspection, no I/O beyond reading the prompts file when given. Just dict assembly.
  • "Extra" keys on a sample are free — metrics each read what they need and ignore the rest. So one fat samples list naturally serves many metrics in one Evaluator.
  • No "primary modality" axis. Each modality is a named kwarg (video=, audio=); pass whichever you have. Both is fine.
  • No mode axis. The shape of the output is determined by cardinality: when |gen| == |ref| the samples list is pair-zipped; when |gen| < |ref| the unmatched references become role-tagged set samples for corpus-shaped metrics (FVD / FAD) without disturbing per-sample paired metrics (LPIPS / PSNR / SSIM / gt_optical_flow).
Classes
Functions
fastvideo.eval.io.inputs.as_video
as_video(x: str | Path | Tensor | Video) -> Video

Coerce path/tensor/Video → :class:Video for the pool to decode.

Path strings and :class:pathlib.Path become Video(source=str(x)); the pool then calls :func:load_video on first use. Tensors become Video(source=None, frames=x) — the pool sees .frames already populated and forwards untouched. :class:Video instances pass through.

Source code in fastvideo/eval/io/inputs.py
def as_video(x: str | Path | torch.Tensor | Video) -> Video:
    """Coerce path/tensor/Video → :class:`Video` for the pool to decode.

    Path strings and :class:`pathlib.Path` become ``Video(source=str(x))``;
    the pool then calls :func:`load_video` on first use.  Tensors become
    ``Video(source=None, frames=x)`` — the pool sees ``.frames`` already
    populated and forwards untouched.  :class:`Video` instances pass through.
    """
    if isinstance(x, Video):
        return x
    if isinstance(x, str | Path):
        return Video(source=str(x))
    if isinstance(x, torch.Tensor):
        return Video(source=None, frames=x)
    raise TypeError(f"Cannot coerce {type(x).__name__} to Video")
fastvideo.eval.io.inputs.samples_from
samples_from(*, video: PathSpec | None = None, reference: PathSpec | None = None, audio: PathSpec | None = None, reference_audio: PathSpec | None = None, text_prompt: str | None = None, text_prompts: str | Path | list[str] | None = None, fps: float | None = None, auxiliary_info: dict | list[dict] | None = None, extras: dict | list[dict] | None = None, extract_audio: bool | str | Path = False, extract_workers: int = 4) -> list[dict]

Build a samples list from path-style inputs.

Parameters

video, reference, audio, reference_audio : File path, directory of files (sorted by name), or any iterable of paths. Pass whichever modalities apply to the metrics you plan to run — they attach to sample["video"] / sample["reference"] / sample["audio"] / sample["reference_audio"] respectively. Video paths are wrapped in :class:Video so :class:VideoPool decodes them lazily in parallel; audio paths stay as strings (audio metrics each load with their own resample / preprocess). text_prompt : A single prompt string broadcast onto every sample. text_prompts : A list of strings (one per sample), or a path to a .jsonl / .json file containing per-sample prompts. fps : Scalar fps broadcast onto every sample. auxiliary_info : Single dict (broadcast) or list of dicts (zipped) for sample["auxiliary_info"] — vbench structured-prompt metrics read this. extras : Catch-all per-sample attachments. Use for metric-specific keys the dedicated kwargs don't cover (scenario, view, actions, calibration, reference_take2, ...). Pass a single dict to broadcast or a list-of-dicts to zip; the keys merge into each sample dict. extract_audio : If truthy, auto-extract audio from each video / reference source into .wav files via PyAV and attach the paths under sample["audio"] / sample["reference_audio"]. Pass a path for a persistent cache, True for a tempdir. Skipped silently for videos with no audio stream; ignored wherever audio / reference_audio is already explicit. extract_workers : Parallel workers for extract_audio.

Returns

list[dict] Canonical samples shape — hand directly to :meth:Evaluator.evaluate.

Cardinality and shape

Let N = len(generated inputs) (the agreed length of whichever of video / audio you passed). References are attached 1:1 onto the first N samples; any extras (when |ref| > N) become standalone role-tagged samples at the end of the list, so set metrics like FVD see the full reference corpus while per-sample paired metrics like LPIPS only run on the first N pairs.

Notes

"Missing" keys are simply absent from the sample dict. Metrics handle them per their own contract (sample.get(...) for optional, sample[...] raises for required, or :meth:BaseMetric._skip for opt-in skip behavior). One fat samples list with many keys can serve many metrics — each reads its subset.

Source code in fastvideo/eval/io/inputs.py
def samples_from(
    *,
    # Modality inputs — pass whichever you have, in any combination.
    video: PathSpec | None = None,
    reference: PathSpec | None = None,
    audio: PathSpec | None = None,
    reference_audio: PathSpec | None = None,
    # Per-sample attachments.  Scalars broadcast to every sample; lists
    # / jsonl paths zip per-sample.
    text_prompt: str | None = None,
    text_prompts: str | Path | list[str] | None = None,
    fps: float | None = None,
    auxiliary_info: dict | list[dict] | None = None,
    # Catch-all for exotic metric inputs (physics_iq scenarios,
    # synthetic_optical_flow actions, ...).  Single dict broadcasts;
    # list of dicts zips.  Keys merge into each sample dict.
    extras: dict | list[dict] | None = None,
    # Sugar: pull audio off the video sources via PyAV.  Pass a path
    # to use a persistent on-disk cache, ``True`` for a system tempdir.
    extract_audio: bool | str | Path = False,
    extract_workers: int = 4,
) -> list[dict]:
    """Build a samples list from path-style inputs.

    Parameters
    ----------
    video, reference, audio, reference_audio :
        File path, directory of files (sorted by name), or any iterable
        of paths.  Pass whichever modalities apply to the metrics you
        plan to run — they attach to ``sample["video"]`` /
        ``sample["reference"]`` / ``sample["audio"]`` /
        ``sample["reference_audio"]`` respectively.  Video paths are
        wrapped in :class:`Video` so :class:`VideoPool` decodes them
        lazily in parallel; audio paths stay as strings (audio metrics
        each load with their own resample / preprocess).
    text_prompt :
        A single prompt string broadcast onto every sample.
    text_prompts :
        A list of strings (one per sample), or a path to a ``.jsonl`` /
        ``.json`` file containing per-sample prompts.
    fps :
        Scalar fps broadcast onto every sample.
    auxiliary_info :
        Single dict (broadcast) or list of dicts (zipped) for
        ``sample["auxiliary_info"]`` — vbench structured-prompt
        metrics read this.
    extras :
        Catch-all per-sample attachments.  Use for metric-specific keys
        the dedicated kwargs don't cover (``scenario``, ``view``,
        ``actions``, ``calibration``, ``reference_take2``, ...).  Pass
        a single dict to broadcast or a list-of-dicts to zip; the keys
        merge into each sample dict.
    extract_audio :
        If truthy, auto-extract audio from each ``video`` /
        ``reference`` source into ``.wav`` files via PyAV and attach
        the paths under ``sample["audio"]`` / ``sample["reference_audio"]``.
        Pass a path for a persistent cache, ``True`` for a tempdir.
        Skipped silently for videos with no audio stream; ignored
        wherever ``audio`` / ``reference_audio`` is already explicit.
    extract_workers :
        Parallel workers for ``extract_audio``.

    Returns
    -------
    list[dict]
        Canonical samples shape — hand directly to
        :meth:`Evaluator.evaluate`.

    Cardinality and shape
    ---------------------
    Let ``N = len(generated inputs)`` (the agreed length of whichever
    of ``video`` / ``audio`` you passed).  References are attached
    1:1 onto the first N samples; any extras (when ``|ref| > N``)
    become standalone role-tagged samples at the end of the list, so
    set metrics like FVD see the full reference corpus while per-sample
    paired metrics like LPIPS only run on the first N pairs.

    Notes
    -----
    "Missing" keys are simply absent from the sample dict.  Metrics
    handle them per their own contract (``sample.get(...)`` for
    optional, ``sample[...]`` raises for required, or
    :meth:`BaseMetric._skip` for opt-in skip behavior).  One fat samples
    list with many keys can serve many metrics — each reads its subset.
    """
    gen_video = _expand(video, _VIDEO_EXTS) if video is not None else None
    gen_audio = _expand(audio, _AUDIO_EXTS) if audio is not None else None
    ref_video = _expand(reference, _VIDEO_EXTS) if reference is not None else None
    ref_audio = _expand(reference_audio, _AUDIO_EXTS) if reference_audio is not None else None

    if gen_video is None and gen_audio is None:
        raise ValueError("samples_from: pass at least one of video= or audio=")

    # All "generated" inputs must agree on length — they describe the
    # same N samples in different modalities.
    n_candidates = [len(x) for x in (gen_video, gen_audio) if x is not None]
    if len({*n_candidates}) != 1:
        raise ValueError(f"samples_from: generated inputs have inconsistent lengths {n_candidates}")
    n = n_candidates[0]

    prompts = _broadcast(text_prompts if text_prompts is not None else text_prompt, n, loader=_load_prompts)
    aux = _broadcast(auxiliary_info, n)
    extras_per_sample = _broadcast(extras, n)
    if (text_prompts is not None) and (text_prompt is not None):
        raise ValueError("Pass either text_prompt (broadcast) or text_prompts (per-sample), not both.")

    samples: list[dict[str, Any]] = []
    for i in range(n):
        s: dict[str, Any] = {}
        if gen_video is not None:
            s["video"] = as_video(gen_video[i])
        if gen_audio is not None:
            s["audio"] = str(gen_audio[i])
        if ref_video is not None and i < len(ref_video):
            s["reference"] = as_video(ref_video[i])
        if ref_audio is not None and i < len(ref_audio):
            s["reference_audio"] = str(ref_audio[i])
        if prompts is not None:
            s["text_prompt"] = prompts[i]
        if fps is not None:
            s["fps"] = fps
        if aux is not None:
            s["auxiliary_info"] = aux[i]
        if extras_per_sample is not None:
            s.update(extras_per_sample[i])
        samples.append(s)

    # Unmatched references → role-tagged set samples.  Per-sample paired
    # metrics skip these via the worker's role-skip rule; set metrics
    # (FVD, FAD) accumulate the features.
    if ref_video is not None and len(ref_video) > n:
        for v in ref_video[n:]:
            samples.append({"video": as_video(v), "role": "reference"})
    if ref_audio is not None and len(ref_audio) > n:
        for a in ref_audio[n:]:
            samples.append({"audio": str(a), "role": "reference"})

    if extract_audio:
        _attach_extracted_audio(samples, extract_audio, extract_workers)
    return samples
fastvideo.eval.io.paths

Filesystem helpers shared by eval scripts.

Provides the prompt-sanitization, default filename convention, and (row, video_path) → eval-kwargs builder. Free functions, not a class — :class:fastvideo.eval.Evaluator is the only stateful object in the eval surface; loops live in user scripts.

Functions
fastvideo.eval.io.paths.build_eval_kwargs
build_eval_kwargs(row: dict, video_path: Path, *, fps: float = 24.0) -> dict[str, Any]

Build evaluator kwargs from a sample row + a video on disk.

Loads the video as (T,C,H,W) and adds the leading batch dim. Forwards prompt (as scalar text_prompt) and auxiliary_info (as scalar dict) when present on the row — matches the one-sample-per-call contract that the evaluator and every metric assume.

Source code in fastvideo/eval/io/paths.py
def build_eval_kwargs(row: dict, video_path: Path, *, fps: float = 24.0) -> dict[str, Any]:
    """Build evaluator kwargs from a sample row + a video on disk.

    Loads the video as ``(T,C,H,W)`` and adds the leading batch dim.
    Forwards ``prompt`` (as scalar ``text_prompt``) and
    ``auxiliary_info`` (as scalar dict) when present on the row —
    matches the one-sample-per-call contract that the evaluator and
    every metric assume.
    """
    from fastvideo.eval.io.video import load_video

    video = load_video(str(video_path))  # (T, C, H, W) in [0, 1]
    kwargs: dict[str, Any] = {
        "video": video.unsqueeze(0),  # (1, T, C, H, W)
        "fps": fps,
    }
    if "prompt" in row:
        kwargs["text_prompt"] = row["prompt"]
    aux = row.get("auxiliary_info")
    if aux:
        kwargs["auxiliary_info"] = aux
    return kwargs
fastvideo.eval.io.paths.default_filename
default_filename(row: dict, idx: int, ext: str = '.mp4') -> str

<sanitized-prompt>-<idx>.mp4 — VBench-style.

Source code in fastvideo/eval/io/paths.py
def default_filename(row: dict, idx: int, ext: str = ".mp4") -> str:
    """``<sanitized-prompt>-<idx>.mp4`` — VBench-style."""
    return f"{sanitize_prompt(row['prompt'])}-{idx}{ext}"
fastvideo.eval.io.paths.glob_videos
glob_videos(videos_dir: Path, row: dict, ext: str = '.mp4') -> list[Path]

Find every generated video for row, sorted by trailing -<idx>.

Source code in fastvideo/eval/io/paths.py
def glob_videos(videos_dir: Path, row: dict, ext: str = ".mp4") -> list[Path]:
    """Find every generated video for *row*, sorted by trailing ``-<idx>``."""
    pattern = f"{sanitize_prompt(row['prompt'])}-*{ext}"
    files = list(videos_dir.glob(pattern))

    def _idx(p: Path) -> int:
        try:
            return int(p.stem.rsplit("-", 1)[1])
        except (IndexError, ValueError):
            return -1

    return sorted(files, key=_idx)
fastvideo.eval.io.paths.sanitize_prompt
sanitize_prompt(prompt: str, max_len: int = 100) -> str

Prompt → safe filename stem.

Source code in fastvideo/eval/io/paths.py
def sanitize_prompt(prompt: str, max_len: int = 100) -> str:
    """Prompt → safe filename stem."""
    s = _INVALID_CHARS.sub("", prompt[:max_len]).strip().strip(".")
    return re.sub(r"\s+", " ", s) or "output"
fastvideo.eval.io.video
Functions
fastvideo.eval.io.video.extract_frames
extract_frames(video: Tensor, n_frames: int | None = None) -> Tensor

Uniformly sample n_frames from a (T, C, H, W) video tensor.

Source code in fastvideo/eval/io/video.py
def extract_frames(video: torch.Tensor, n_frames: int | None = None) -> torch.Tensor:
    """Uniformly sample *n_frames* from a ``(T, C, H, W)`` video tensor."""
    if n_frames is None or n_frames >= video.shape[0]:
        return video
    indices = torch.linspace(0, video.shape[0] - 1, n_frames).long()
    return video[indices]
fastvideo.eval.io.video.load_video
load_video(source: str | Tensor | list, **kwargs) -> Tensor

Load a video as a (T, C, H, W) float32 tensor in [0, 1].

Supported source types:

  • str / Path – path to .mp4 / .avi / .gif file, or a directory of frame images (sorted alphabetically).
  • torch.Tensor – returned as-is after shape validation.
  • list[PIL.Image] – stacked into a tensor.
Source code in fastvideo/eval/io/video.py
def load_video(source: str | torch.Tensor | list, **kwargs) -> torch.Tensor:
    """Load a video as a ``(T, C, H, W)`` float32 tensor in ``[0, 1]``.

    Supported *source* types:

    * **str / Path** – path to ``.mp4`` / ``.avi`` / ``.gif`` file, or a
      directory of frame images (sorted alphabetically).
    * **torch.Tensor** – returned as-is after shape validation.
    * **list[PIL.Image]** – stacked into a tensor.
    """
    if isinstance(source, torch.Tensor):
        if source.ndim != 4:
            raise ValueError(f"Expected video tensor with 4 dims (T,C,H,W), got {source.ndim}")
        return source.float()

    if isinstance(source, list):
        frames = [_pil_to_tensor(img) for img in source]
        return torch.stack(frames)

    path = Path(source)
    if path.is_dir():
        return _load_frame_dir(path)
    return _load_video_file(str(path))

fastvideo.eval.memory

Small GPU-memory helper used by :class:EvalWorker.

Functions

fastvideo.eval.memory.clear_cache
clear_cache() -> None

Free GPU cache + run garbage collection.

Source code in fastvideo/eval/memory.py
def clear_cache() -> None:
    """Free GPU cache + run garbage collection."""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

fastvideo.eval.metrics

Auto-discover and register all built-in metrics (recursive).

Walks the metrics package tree and imports any leaf-package's metric module so @register decorators fire. Path components starting with _ are skipped (used for shared helpers like optical_flow/_shared.py and vbench/_grit_helper.py).

Modules

fastvideo.eval.metrics.audio
Modules
fastvideo.eval.metrics.audio.audiobox_aesthetics
Modules
fastvideo.eval.metrics.audio.audiobox_aesthetics.metric

AudioBox Aesthetics (CE, CU, PC, PQ).

Thin wrapper around Meta's audiobox_aesthetics predictor. Returns four per-clip dimensions (CE — Content Enjoyment, CU — Content Usefulness, PC — Production Complexity, PQ — Production Quality); score exposes PQ, the dimension V2A papers typically report on. The remaining three are surfaced under details.

The earlier Verse-Bench combined score (CE + CU + PQ + (11 − PC)) / 4 is non-standard and is deliberately not used.

Classes
fastvideo.eval.metrics.audio.audiobox_aesthetics.metric.AudioBoxAestheticsMetric
AudioBoxAestheticsMetric()

Bases: BaseMetric

AudioBox Aesthetics: PQ as the primary score, CE/CU/PC/PQ in details.

Source code in fastvideo/eval/metrics/audio/audiobox_aesthetics/metric.py
def __init__(self) -> None:
    super().__init__()
    self._predictor: Any = None
Functions
fastvideo.eval.metrics.audio.clap_score
Modules
fastvideo.eval.metrics.audio.clap_score.metric

CLAP Score (CS) — text-audio cosine similarity.

Uses HuggingFace transformers.ClapModel with the laion/clap-htsat-fused checkpoint, the closest HF mirror of the 630k-audioset-fusion-best.pt weights that hkchengrex/av-benchmark and the V2A literature use. The fully byte-exact comparison still requires the laion_clap pip package (which pins numpy<2 and forces a fastvideo-wide downgrade — so we don't depend on it). Numbers from this metric are in the same ballpark as av-benchmark's LAION_CLAP.

Classes
fastvideo.eval.metrics.audio.clap_score.metric.ClapScoreMetric
ClapScoreMetric(model_name: str = DEFAULT_CLAP_REPO)

Bases: BaseMetric

CLAP score — text-audio cosine similarity via HF CLAP.

Source code in fastvideo/eval/metrics/audio/clap_score/metric.py
def __init__(self, model_name: str = DEFAULT_CLAP_REPO) -> None:
    super().__init__()
    self._model_name = model_name
    self._model: Any = None
    self._processor: Any = None
Functions
fastvideo.eval.metrics.audio.desync
Modules
fastvideo.eval.metrics.audio.desync.metric

Synchformer DeSync — average audio-video desynchronization in seconds.

Per-sample. Ports hkchengrex/av-benchmark's DeSync against the 24-01-04T16-39-21 Synchformer checkpoint (the one MMAudio reports against). argmax over the 21-class grid in [-2, +2] seconds for the first 14 and last 14 segments separately; lower is better, 0 = ideal.

Synchformer's transformer carries a fixed positional embedding sized for ~14 segments per modality, so keep input clips in the 3-10 s range — shorter clips raise in _segment_video and longer clips ignore the middle.

Classes
fastvideo.eval.metrics.audio.desync.metric.DeSyncMetric
DeSyncMetric(src_fps: float | None = None)

Bases: BaseMetric

Per-sample Synchformer DeSync magnitude in seconds (lower is better).

Source code in fastvideo/eval/metrics/audio/desync/metric.py
def __init__(self, src_fps: float | None = None) -> None:
    super().__init__()
    # Defaults to 25 fps (Synchformer's target) when the pool's decode fps is unknown.
    self._src_fps = src_fps
    self._model: Any = None
    self._mel: Any = None
    self._grid: torch.Tensor | None = None
    self._resample_cache: dict[int, Any] = {}
Functions
fastvideo.eval.metrics.audio.frechet_distance
Modules
fastvideo.eval.metrics.audio.frechet_distance.metric

Frechet Audio Distance over PaSST embeddings (FD_PaSST).

Corpus-vs-corpus Fréchet distance between Gaussian moments of two PaSST 768-d embedding sets. Ports av_bench.metrics.fad.compute_fd 1:1.

References are supplied per-sample via reference_audio / role="reference", or once from a .pt cache at $FASTVIDEO_FAD_REF_FEATURES. Skips when either side has fewer than two finite embeddings.

Classes
fastvideo.eval.metrics.audio.frechet_distance.metric.FrechetAudioDistanceMetric
FrechetAudioDistanceMetric()

Bases: BaseMetric

Corpus-vs-corpus Frechet Audio Distance with PaSST embeddings.

Source code in fastvideo/eval/metrics/audio/frechet_distance/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
    self._gen_buf: list[np.ndarray] = []
    self._ref_buf: list[np.ndarray] = []
    self._cached_ref_path: str | None = os.environ.get(REF_FEATURES_ENV)
    self._cached_ref_mu: np.ndarray | None = None
    self._cached_ref_sigma: np.ndarray | None = None
    self._n_cached_ref: int = 0
Functions
fastvideo.eval.metrics.audio.imagebind_score
Modules
fastvideo.eval.metrics.audio.imagebind_score.metric

ImageBind audio↔video cosine similarity (IB-Score).

Per-sample. Reads sample["video_path"] (or sample["video"].source for a :class:Video wrapper) and sample["audio"]; ImageBind decodes its own clips so the path is required, not the pool-decoded tensor.

Classes
fastvideo.eval.metrics.audio.imagebind_score.metric.ImageBindScoreMetric
ImageBindScoreMetric()

Bases: BaseMetric

ImageBind audio↔video cosine similarity, per-sample.

Source code in fastvideo/eval/metrics/audio/imagebind_score/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
Functions
fastvideo.eval.metrics.audio.kl_divergence
Modules
fastvideo.eval.metrics.audio.kl_divergence.metric

PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.

Ports av_bench.metrics.kl.compute_kl 1:1. The primary score is the softmax variant (reported as "MKL" / "KL_PaSST" by V2A papers); the sigmoid variant is exposed in details["kl_sigmoid"].

Classes
fastvideo.eval.metrics.audio.kl_divergence.metric.KLDivergenceMetric
KLDivergenceMetric()

Bases: BaseMetric

PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.

Per-sample. Requires sample["audio"] (generated) and sample["reference_audio"] (ground truth).

Source code in fastvideo/eval/metrics/audio/kl_divergence/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
Functions
fastvideo.eval.metrics.audio.wer
Modules
fastvideo.eval.metrics.audio.wer.metric

Word Error Rate for generated audio.

Per-sample. Default backend is Whisper-base; glm_asr (vendored under third_party/eval/glmasr/) and sensevoice (FunASR, install-on- demand) are selectable via asr_backend=. Text is normalized per the MagiHuman convention and CJK inputs are scored at the character level.

Classes
fastvideo.eval.metrics.audio.wer.metric.WERMetric
WERMetric(asr_backend: str = 'whisper', model_name: str | None = None, instruction: str = 'Please transcribe this audio into text')

Bases: BaseMetric

WER metric with selectable ASR backend.

Sample fields: audio, reference_text (or text_prompt), optional language. instruction is passed to GLM-ASR.

Source code in fastvideo/eval/metrics/audio/wer/metric.py
def __init__(
    self,
    asr_backend: str = "whisper",
    model_name: str | None = None,
    instruction: str = "Please transcribe this audio into text",
) -> None:
    super().__init__()
    if asr_backend not in ("whisper", "glm_asr", "sensevoice"):
        raise ValueError(f"Unknown ASR backend '{asr_backend}'. "
                         "Supported: ['whisper', 'glm_asr', 'sensevoice']")
    self._asr_backend = asr_backend
    self._model_name = model_name
    self._instruction = instruction
    self._model: Any = None
    self._processor: Any = None
Functions
fastvideo.eval.metrics.base
Classes
fastvideo.eval.metrics.base.BaseMetric
BaseMetric()

Abstract base class for all eval metrics.

Two execution shapes:

  • Per-sample (is_set_metric=False, default) — implement :meth:compute. The Evaluator calls it once per input sample and returns one :class:MetricResult per sample.
  • Set-vs-set (is_set_metric=True) — implement :meth:accumulate (called once per sample to buffer features) and :meth:finalize (called once after all samples to compute the corpus-level result). Use :meth:reset to clear buffers and :meth:merge_from to fold multi-GPU per-worker state together.

Optionally override :meth:setup to eagerly load models. Metrics that chunk along the time dim for memory hardcode their own chunk size in __init__ (see optical_flow for the canonical example). Eval always processes one video per :meth:Evaluator.evaluate call; compute / accumulate receive a single sample, not a batch.

Source code in fastvideo/eval/metrics/base.py
def __init__(self) -> None:
    self._device: torch.device = torch.device("cpu")
Functions
fastvideo.eval.metrics.base.BaseMetric.accumulate
accumulate(sample: dict) -> None

Buffer per-sample features for a corpus-level metric.

Source code in fastvideo/eval/metrics/base.py
def accumulate(self, sample: dict) -> None:
    """Buffer per-sample features for a corpus-level metric."""
    raise NotImplementedError(f"{type(self).__name__}.accumulate is not implemented")
fastvideo.eval.metrics.base.BaseMetric.compute
compute(sample: dict) -> MetricResult

Per-sample metrics: compute the score for one sample.

sample["video"] is (T, C, H, W) float in [0, 1]. sample["reference"] (if used) has the same shape. Return self._skip(sample, reason) for missing inputs.

Source code in fastvideo/eval/metrics/base.py
def compute(self, sample: dict) -> MetricResult:
    """Per-sample metrics: compute the score for one sample.

    ``sample["video"]`` is ``(T, C, H, W)`` float in ``[0, 1]``.
    ``sample["reference"]`` (if used) has the same shape. Return
    ``self._skip(sample, reason)`` for missing inputs.
    """
    raise NotImplementedError(f"{type(self).__name__}.compute is not implemented")
fastvideo.eval.metrics.base.BaseMetric.finalize
finalize() -> MetricResult

Compute the corpus-level result from buffered state.

Source code in fastvideo/eval/metrics/base.py
def finalize(self) -> MetricResult:
    """Compute the corpus-level result from buffered state."""
    raise NotImplementedError(f"{type(self).__name__}.finalize is not implemented")
fastvideo.eval.metrics.base.BaseMetric.merge_from
merge_from(other: BaseMetric) -> None

Multi-GPU: fold another worker's accumulator state into this one.

Source code in fastvideo/eval/metrics/base.py
def merge_from(self, other: BaseMetric) -> None:  # noqa: B027 - intentionally optional override
    """Multi-GPU: fold another worker's accumulator state into this one."""
fastvideo.eval.metrics.base.BaseMetric.reset
reset() -> None

Clear accumulator state at the start of each evaluate() call.

Source code in fastvideo/eval/metrics/base.py
def reset(self) -> None:  # noqa: B027 - intentionally optional override
    """Clear accumulator state at the start of each evaluate() call."""
fastvideo.eval.metrics.base.BaseMetric.setup
setup() -> None

Eagerly load models. Called once by :class:EvalWorker.

Default is a no-op; metrics with no eager state (pixel math, closed-form ops) inherit this. Override only if your metric needs to load weights.

Source code in fastvideo/eval/metrics/base.py
def setup(self) -> None:  # noqa: B027 - intentionally optional override
    """Eagerly load models. Called once by :class:`EvalWorker`.

    Default is a no-op; metrics with no eager state (pixel math,
    closed-form ops) inherit this. Override only if your metric
    needs to load weights.
    """
fastvideo.eval.metrics.base.BaseMetric.to
to(device: str | device) -> BaseMetric

Move metric (and its internal models) to device.

Source code in fastvideo/eval/metrics/base.py
def to(self, device: str | torch.device) -> BaseMetric:
    """Move metric (and its internal models) to *device*."""
    self._device = torch.device(device)
    return self
fastvideo.eval.metrics.common
Modules
fastvideo.eval.metrics.common.fvd
Modules
fastvideo.eval.metrics.common.fvd.extractors

Pluggable video feature extractors for FVD.

Three extractors, sharing the _BaseExtractor contract:

  • i3d — Kinetics-400 I3D (TorchScript, flateon/FVD-I3D-torchscript). The standard FVD feature space used in the literature.
  • clip — CLIP ViT-B/32 per-frame embeddings, mean-pooled over time. Captures semantic / content quality.
  • videomae — VideoMAE-base last-hidden-state, mean-pooled over patch tokens. Captures structural / motion quality.

The contract is intentionally narrow: each extractor takes a (B, T, C, H, W) float tensor in [0, 1] and returns (B, D) numpy features. Preprocessing (resize, normalize, layout) is the extractor's job; its callers should not care.

Functions
fastvideo.eval.metrics.common.fvd.extractors.load_extractor
load_extractor(name: str, device: device) -> _BaseExtractor

Instantiate the named extractor on device. Raises ValueError on unknown names.

Source code in fastvideo/eval/metrics/common/fvd/extractors.py
def load_extractor(name: str, device: torch.device) -> _BaseExtractor:
    """Instantiate the named extractor on *device*. Raises ``ValueError`` on unknown names."""
    cls = _EXTRACTORS.get(name)
    if cls is None:
        raise ValueError(f"Unknown FVD extractor '{name}'. Available: {available_extractors()}")
    return cls(device)
fastvideo.eval.metrics.common.fvd.metric

common.fvd — Fréchet Video Distance.

Measures distributional similarity between generated and reference videos using a pluggable feature backbone (I3D / CLIP / VideoMAE). Lower score → generated videos are closer to the real video distribution.

FVD is a dataset-level metric — it compares the distribution of a collection of videos rather than scoring each video individually. For statistically reliable results at least 256 videos are recommended; the standard protocol uses 2048.

This metric follows the set-vs-set protocol (is_set_metric=True):

- :meth:accumulate is called once per video to buffer features. Routes on sample["role"]"reference" → real buffer, anything else → generated buffer. This is the same convention :class:audio.frechet_distance.FrechetAudioDistanceMetric uses. - :meth:finalize is called once after all videos to compute FVD. - :meth:reset clears both buffers between evaluation runs.

To score two corpora, build a samples list with :func:fastvideo.eval.io.inputs.samples_from and hand it to :meth:Evaluator.evaluate — the path-friendly form is::

from fastvideo.eval import create_evaluator, samples_from
ev = create_evaluator(metrics=["common.fvd"], device="cuda:0")
result = ev.evaluate(samples=samples_from(
    video="gen/", reference="ref/",
)).corpus["common.fvd"]
Extractors
  • i3d (default) — Kinetics-400 I3D, the standard FVD feature space used in the literature.
  • clip — CLIP ViT-B/32 per-frame embeds, mean-pooled over time. Captures semantic / content quality.
  • videomae — VideoMAE-base last-hidden-state, mean-pooled over patches. Captures structural / motion quality.

CLIP and VideoMAE are research-grade and not directly comparable to FVD scores from the literature; use i3d for paper comparisons.

Reference features

Two ways to supply reference features:

  1. Stream: pass reference= to :func:samples_from. Equal cardinality emits paired samples (sample["video"] + sample["reference"]); unequal cardinality role-tags the unmatched references. accumulate routes both shapes correctly. If cache_mode="read_write" (default) and no cache exists, the extracted features are persisted to cache_path on :meth:finalize.
  2. Cache hit: a prior run wrote cache_path; :meth:setup loads it automatically and streamed references are unnecessary.

Streamed references always win over the cache when both are present.

Cache resolution order (first match wins): 1. cache_path= constructor kwarg, if set. 2. $FASTVIDEO_FVD_REF_FEATURES, if set. 3. ${FASTVIDEO_EVAL_CACHE}/fvd/real_features_{extractor}.pt (default).

The env-var override mirrors audio.frechet_distance's FASTVIDEO_FAD_REF_FEATURES pattern.

Classes
fastvideo.eval.metrics.common.fvd.metric.FVDMetric
FVDMetric(extractor: str = 'i3d', cache_path: str | None = None, cache_mode: str = 'read_write', chunk_size: int = 32)

Bases: BaseMetric

Fréchet Video Distance (FVD) over a pluggable feature backbone.

Set-vs-set metric (is_set_metric=True). The Evaluator calls :meth:accumulate once per video and :meth:finalize once after all videos have been processed.

For meaningful scores, evaluate over ≥ 256 videos (2048 is the standard protocol used in the literature).

Parameters

extractor : str Feature backbone name. One of "i3d" (default), "clip", "videomae". See module docstring for guidance. cache_path : str, optional Where extracted reference features are cached. Resolution order: constructor kwarg → $FASTVIDEO_FVD_REF_FEATURES env-var → ${FASTVIDEO_EVAL_CACHE}/fvd/real_features_{extractor}.pt (default, resolved at :meth:setup time so the env-var can be set after import). cache_mode : str "read_write" (default) — read at setup, write at finalize on cache miss. "read" — read at setup, never write. "off" — ignore the cache entirely. Auto-write happens only when the cache was missing AND reference features streamed in this run. chunk_size : int Videos per forward pass. Reduce if GPU runs OOM.

Source code in fastvideo/eval/metrics/common/fvd/metric.py
def __init__(
    self,
    extractor: str = "i3d",
    cache_path: str | None = None,
    cache_mode: str = "read_write",
    chunk_size: int = 32,
) -> None:
    super().__init__()
    if extractor not in available_extractors():
        raise ValueError(f"Unknown FVD extractor '{extractor}'. "
                         f"Available: {available_extractors()}")
    if cache_mode not in {"off", "read", "read_write"}:
        raise ValueError(f"cache_mode must be one of 'off', 'read', 'read_write'; got {cache_mode!r}")
    self._extractor_name = extractor
    self._cache_path_arg = cache_path
    self._cache_mode = cache_mode
    self._chunk = chunk_size
    self._extractor: _BaseExtractor | None = None

    # Accumulated feature buffers — cleared by reset()
    self._gen_buf: list[np.ndarray] = []
    self._real_buf: list[np.ndarray] = []
    # Cache loaded at setup() time; survives reset() (the cached corpus
    # is part of the metric's configured state, not its run state).
    self._cached_real: np.ndarray | None = None
Functions
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.accumulate
accumulate(sample: dict) -> None

Extract features from one video and route by sample["role"].

Routing rules:

  • role="reference" → real buffer; the reference key is ignored.
  • Otherwise → generated buffer. If sample["reference"] is also present, its features also go into the real buffer — this lets a paired samples list (the shape per-sample metrics like LPIPS need) feed FVD in the same pass without a second decode.

Accepts either a raw tensor or a populated :class:Video instance under either key.

Source code in fastvideo/eval/metrics/common/fvd/metric.py
def accumulate(self, sample: dict) -> None:
    """Extract features from one video and route by ``sample["role"]``.

    Routing rules:

    * ``role="reference"`` → real buffer; the reference key is ignored.
    * Otherwise → generated buffer.  If ``sample["reference"]`` is
      also present, its features also go into the real buffer — this
      lets a paired samples list (the shape per-sample metrics like
      LPIPS need) feed FVD in the same pass without a second decode.

    Accepts either a raw tensor or a populated :class:`Video` instance
    under either key.
    """
    if self._extractor is None:
        self.setup()
    assert self._extractor is not None  # for type narrowing

    if sample.get("role") == "reference":
        self._real_buf.append(self._extract(sample["video"]))
        return

    self._gen_buf.append(self._extract(sample["video"]))
    if "reference" in sample:
        self._real_buf.append(self._extract(sample["reference"]))
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.finalize
finalize() -> MetricResult

Compute FVD from buffered generated features vs. reference features.

Reference source priority: streamed (_real_buf) > disk cache (_cached_real). On a fresh cache + streamed refs, write the accumulated real features back to disk if cache_mode="read_write".

Source code in fastvideo/eval/metrics/common/fvd/metric.py
def finalize(self) -> MetricResult:
    """Compute FVD from buffered generated features vs. reference features.

    Reference source priority: streamed (``_real_buf``) > disk cache
    (``_cached_real``).  On a fresh cache + streamed refs, write the
    accumulated real features back to disk if ``cache_mode="read_write"``.
    """
    if not self._gen_buf:
        return MetricResult(
            name=self.name,
            score=None,
            details={"skipped": "No generated videos accumulated before finalize()."},
        )
    all_gen = np.concatenate(self._gen_buf, axis=0)

    if self._real_buf:
        all_real = np.concatenate(self._real_buf, axis=0)
        ref_source = "streamed"
        # Persist to cache only when we built real features fresh and
        # the cache slot was empty — never silently overwrite an
        # existing cache.
        if self._cache_mode == "read_write" and self._cached_real is None:
            self._save_cache(all_real)
    elif self._cached_real is not None:
        all_real = self._cached_real
        ref_source = "cached"
    else:
        return MetricResult(
            name=self.name,
            score=None,
            details={
                "skipped": ("No reference features available. Pass role='reference' samples "
                            "(or a paired samples list with 'reference' key) to "
                            f"Evaluator.evaluate(), or pre-build the cache at: {self.cache_path}")
            },
        )

    n_gen = len(all_gen)
    n_real = len(all_real)

    if n_gen < _MIN_VIDEOS_WARN or n_real < _MIN_VIDEOS_WARN:
        warnings.warn(
            f"FVD computed with only {n_gen} generated and {n_real} real videos. "
            f"At least {_MIN_VIDEOS_WARN} recommended (standard protocol: 2048). "
            "Score may not be statistically reliable.",
            stacklevel=2,
        )

    mu_gen, sigma_gen = _gaussian_params(all_gen)
    mu_real, sigma_real = _gaussian_params(all_real)
    fvd = _frechet_distance(mu_gen, sigma_gen, mu_real, sigma_real)

    return MetricResult(
        name=self.name,
        score=fvd,
        details={
            "extractor": self._extractor_name,
            "n_generated": n_gen,
            "n_reference": n_real,
            "ref_source": ref_source,
        },
    )
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.merge_from
merge_from(other: BaseMetric) -> None

Fold another worker's accumulated features into this one (multi-GPU).

Both gen and ref buffers concatenate. The disk cache is the same across workers, so we only adopt other's _cached_real if our own slot is empty (defensive — should always already match).

Source code in fastvideo/eval/metrics/common/fvd/metric.py
def merge_from(self, other: BaseMetric) -> None:
    """Fold another worker's accumulated features into this one (multi-GPU).

    Both gen and ref buffers concatenate.  The disk cache is the same
    across workers, so we only adopt *other*'s ``_cached_real`` if our
    own slot is empty (defensive — should always already match).
    """
    assert isinstance(other, FVDMetric)
    assert other._extractor_name == self._extractor_name, (
        f"merge_from extractor mismatch: {other._extractor_name!r} vs {self._extractor_name!r}")
    self._gen_buf.extend(other._gen_buf)
    self._real_buf.extend(other._real_buf)
    if self._cached_real is None and other._cached_real is not None:
        self._cached_real = other._cached_real
fastvideo.eval.metrics.common.fvd.metric.FVDMetric.reset
reset() -> None

Clear generated and streamed-reference buffers. Cache survives.

Source code in fastvideo/eval/metrics/common/fvd/metric.py
def reset(self) -> None:
    """Clear generated and streamed-reference buffers. Cache survives."""
    self._gen_buf = []
    self._real_buf = []
Functions
fastvideo.eval.metrics.common.ssim
Modules
fastvideo.eval.metrics.common.ssim.metric
Classes Functions
fastvideo.eval.metrics.optical_flow
Modules
fastvideo.eval.metrics.optical_flow.gt_optical_flow
Modules
fastvideo.eval.metrics.optical_flow.gt_optical_flow.metric

Compare optical flow extracted from a generated video against optical flow extracted from a ground-truth reference video.

Both flows are produced by the same ptlflow model (default dpflow/things). The resulting per-pixel / per-frame / temporal metric set is identical to synthetic_optical_flow — only the way the reference flow is constructed differs.

Classes
fastvideo.eval.metrics.optical_flow.gt_optical_flow.metric.GtOpticalFlowMetric
GtOpticalFlowMetric(model_name: str = 'dpflow', ckpt: str = 'things', min_mag: float = 0.5, max_mag_pct: float = 80.0, grid_size: int = 8)

Bases: BaseMetric

Per-pixel / per-frame / temporal flow comparison vs. a reference video.

The headline score is pixel_epe_mean_mean (lower is better); every other scalar lives in details so downstream consumers can pick whichever one they care about.

Source code in fastvideo/eval/metrics/optical_flow/gt_optical_flow/metric.py
def __init__(
    self,
    model_name: str = "dpflow",
    ckpt: str = "things",
    min_mag: float = 0.5,
    max_mag_pct: float = 80.0,
    grid_size: int = 8,
) -> None:
    super().__init__()
    self.model_name = model_name
    self.ckpt = ckpt
    self.min_mag = min_mag
    self.max_mag_pct = max_mag_pct
    self.grid_size = grid_size
    self._model = None
    # One frame-pair per DPFlow forward: the cost volume is ~4 GB
    # per pair at 1080p, so batching OOMs. Bump for low-res inputs.
    self._chunk_size = 1
Functions
fastvideo.eval.metrics.optical_flow.synthetic_optical_flow
Modules
fastvideo.eval.metrics.optical_flow.synthetic_optical_flow.metric

Compare optical flow extracted from a generated video against optical flow synthesized analytically from per-frame actions.

The reference flow is not observed from a ground-truth video — it's predicted from the action stream via a third-person camera-kinematics model (Longuet-Higgins linearization + off-pivot translation correction; no depth). Observed flow comes from the same ptlflow model used by gt_optical_flow, and the two are compared with the identical metric set, so scores are directly comparable across the two metrics.

Required sample keys

video (B, T, C, H, W) float in [0, 1]. actions dict (or list-of-dicts of length B) with two np.ndarray keys:

* ``keyboard`` of shape ``(T, 6)`` — ``[W, S, A, D, turn_left, turn_right]``
* ``mouse`` of shape ``(T, 2)`` — ``[pitch, yaw]``

calibration Either a path to a ThirdPersonCalibration JSON file, or a dict of fitted parameters. May also be set once at construction time via calibration_path= and reused across samples.

Optional sample keys

mouse_pitch_sign +1 (default) or -1 if the dataset's mouse-pitch sign is flipped (mhuo's data carries this in metadata).

Classes
fastvideo.eval.metrics.optical_flow.synthetic_optical_flow.metric.SyntheticOpticalFlowMetric
SyntheticOpticalFlowMetric(model_name: str = 'dpflow', ckpt: str = 'things', calibration_path: str | Path | None = None, min_mag: float = 0.5, max_mag_pct: float = 80.0, grid_size: int = 8)

Bases: BaseMetric

Action-driven synthetic flow vs. video-extracted observed flow.

Pass calibration_path at construction to bind the calibration once across all samples; otherwise supply sample["calibration"] per call. Missing actions or calibration produce a skipped result (score=None) rather than raising.

Source code in fastvideo/eval/metrics/optical_flow/synthetic_optical_flow/metric.py
def __init__(
    self,
    model_name: str = "dpflow",
    ckpt: str = "things",
    calibration_path: str | Path | None = None,
    min_mag: float = 0.5,
    max_mag_pct: float = 80.0,
    grid_size: int = 8,
) -> None:
    super().__init__()
    self.model_name = model_name
    self.ckpt = ckpt
    self.min_mag = min_mag
    self.max_mag_pct = max_mag_pct
    self.grid_size = grid_size
    self._calibration: ThirdPersonCalibration | None = (_resolve_calibration(calibration_path)
                                                        if calibration_path else None)
    self._model = None
    # One frame-pair per DPFlow forward; see ``gt_optical_flow``.
    self._chunk_size = 1
Functions
fastvideo.eval.metrics.physics_iq
Modules
fastvideo.eval.metrics.physics_iq.utils
Functions
fastvideo.eval.metrics.physics_iq.utils.prepare_pair
prepare_pair(sample: dict[str, Any], *, prep_kwargs: dict[str, Any] | None = None) -> PreparedPhysicsIQPair

Resolve a sample into a prepared (gen, ref) pair.

Caches the result on sample['_physics_iq_pair'] so other physics_iq sub-metrics on the same sample reuse it instead of re-decoding.

Source code in fastvideo/eval/metrics/physics_iq/utils.py
def prepare_pair(
    sample: dict[str, Any],
    *,
    prep_kwargs: dict[str, Any] | None = None,
) -> PreparedPhysicsIQPair:
    """Resolve a sample into a prepared (gen, ref) pair.

    Caches the result on ``sample['_physics_iq_pair']`` so other physics_iq
    sub-metrics on the same sample reuse it instead of re-decoding.
    """
    prepared = sample.get("_physics_iq_pair")
    if prepared is not None:
        return prepared

    if "reference" not in sample:
        raise KeyError("Physics-IQ pair metrics require sample['reference'].")

    prepared = prepare_pair_inputs(
        sample["video"],
        sample["reference"],
        generated_mask=sample.get("video_mask"),
        reference_mask=sample.get("reference_mask"),
        **(prep_kwargs or {}),
    )
    sample["_physics_iq_pair"] = prepared
    return prepared
fastvideo.eval.metrics.vbench

VBench metrics. Bootstraps upstream submodule on sys.path and installs runtime compat shims for modern torch/transformers/numpy/timm.

The upstream vbench source lives as a git submodule at fastvideo/third_party/eval/vbench (pinned to a specific Vchitect/VBench SHA). We do not pip-install it — we only need its Python modules importable. Its runtime deps (clip, transformers, etc.) are already in FastVideo's main env.

Compat with modern dependency versions is achieved at import time, in this file, instead of via on-disk patches to upstream files. Each shim below corresponds to a specific drift between vbench's pinned-2023 deps and FastVideo's current pins. Adding a new shim is preferable to editing the submodule.

Modules
fastvideo.eval.metrics.vbench.aesthetic_quality
Modules
fastvideo.eval.metrics.vbench.aesthetic_quality.metric

VBench Aesthetic Quality — CLIP ViT-L/14 + LAION aesthetic predictor.

Encodes frames through CLIP, passes L2-normalized features through a linear aesthetic head (768 → 1), and averages scores / 10.

Classes Functions
fastvideo.eval.metrics.vbench.appearance_style
Modules
fastvideo.eval.metrics.vbench.appearance_style.metric

VBench Appearance Style — CLIP ViT-B/32 text-image alignment.

Per-frame cosine similarity between CLIP image features and a text prompt describing the expected style. Requires sample["text_prompt"].

Classes Functions
fastvideo.eval.metrics.vbench.background_consistency
Modules
fastvideo.eval.metrics.vbench.background_consistency.metric

VBench Background Consistency — CLIP ViT-B/32 temporal feature similarity.

Measures background stability via cosine similarity of CLIP features between consecutive frames and the first frame.

Classes Functions
fastvideo.eval.metrics.vbench.color
Modules
fastvideo.eval.metrics.vbench.color.metric

VBench Color — GRiT dense captioning for color accuracy.

Detects the target object via GRiT and checks if the expected color keyword appears in the object's caption. Score = frames_with_correct_color / frames_with_object_detected.

Classes Functions
fastvideo.eval.metrics.vbench.dynamic_degree
Modules
fastvideo.eval.metrics.vbench.dynamic_degree.metric

VBench Dynamic Degree — RAFT optical flow motion detection.

For each consecutive frame pair, computes optical flow via RAFT and takes the mean of the top 5% flow magnitudes. If enough pairs exceed an adaptive threshold, the video is classified as dynamic (1.0) vs static (0.0).

Classes
fastvideo.eval.metrics.vbench.dynamic_degree.metric.DynamicDegreeMetric
DynamicDegreeMetric()

Bases: BaseMetric

Source code in fastvideo/eval/metrics/vbench/dynamic_degree/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
    self._chunk_size = 16
Functions
fastvideo.eval.metrics.vbench.human_action
Modules
fastvideo.eval.metrics.vbench.human_action.metric

VBench Human Action — UMT ViT-L/16 action classification (Kinetics-400).

Classifies human actions in 16-frame clips. Top-5 predictions with confidence >= 0.85 are compared against the ground-truth action label. Score = 1.0 if match found, 0.0 otherwise.

Classes Functions
fastvideo.eval.metrics.vbench.imaging_quality
Modules
fastvideo.eval.metrics.vbench.imaging_quality.metric

VBench Imaging Quality — MUSIQ-based per-frame technical quality.

Uses MUSIQ (Multi-Scale Image Quality) from pyiqa. Frames are resized so the longer side is at most 512px. Score = mean(MUSIQ_scores) / 100.

Classes Functions
fastvideo.eval.metrics.vbench.motion_smoothness
Modules
fastvideo.eval.metrics.vbench.motion_smoothness.metric

VBench Motion Smoothness — AMT-S frame interpolation quality.

Takes every-other frame, uses AMT-S to interpolate the missing middle frames, then compares interpolated vs actual frames. Score = (255 - mean_pixel_diff) / 255. Higher = smoother motion.

Classes
fastvideo.eval.metrics.vbench.motion_smoothness.metric.MotionSmoothnessMetric
MotionSmoothnessMetric()

Bases: BaseMetric

Source code in fastvideo/eval/metrics/vbench/motion_smoothness/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
    self._embt: Any = None
    self._chunk_size = 8
Functions
fastvideo.eval.metrics.vbench.multiple_objects
Modules
fastvideo.eval.metrics.vbench.multiple_objects.metric

VBench Multiple Objects — GRiT detection for dual-object presence.

Checks if BOTH target objects are detected in each of 16 sampled frames. Score = matching_frames / total_frames.

Classes Functions
fastvideo.eval.metrics.vbench.object_class
Modules
fastvideo.eval.metrics.vbench.object_class.metric

VBench Object Class — GRiT object detection for class matching.

Checks if a target object class is detected in each of 16 sampled frames. Score = matching_frames / total_frames.

Classes Functions
fastvideo.eval.metrics.vbench.overall_consistency
Modules
fastvideo.eval.metrics.vbench.overall_consistency.metric

VBench Overall Consistency — ViCLIP text-video alignment.

Encodes 8 sampled video frames via ViCLIP vision encoder and a text prompt via ViCLIP text encoder, then computes cosine similarity.

Classes Functions
fastvideo.eval.metrics.vbench.scene
Modules
fastvideo.eval.metrics.vbench.scene.metric

VLM-based scene matching using AVoCaDO (Qwen2.5-Omni).

Replaces VBench's Tag2Text-based scene metric with a modern VLM caption. The algorithm follows VBench: 1. Caption the video 2. Check if all scene keywords appear in the caption 3. Score = 1.0 if all match, 0.0 otherwise

Unlike VBench (which captions each frame separately with Tag2Text), AVoCaDO captions the entire video in one pass with rich natural language, making the keyword check more robust.

Classes
fastvideo.eval.metrics.vbench.scene.metric.SceneMetric
SceneMetric(model_path: str = 'AVoCaDO-Captioner/AVoCaDO')

Bases: BaseMetric

Source code in fastvideo/eval/metrics/vbench/scene/metric.py
def __init__(self, model_path: str = "AVoCaDO-Captioner/AVoCaDO") -> None:
    super().__init__()
    self._model: Any = None
    self._processor: Any = None
    self._model_path = model_path
Functions
fastvideo.eval.metrics.vbench.spatial_relationship
Modules
fastvideo.eval.metrics.vbench.spatial_relationship.metric

VBench Spatial Relationship — GRiT detection + bbox position scoring.

Detects two target objects via GRiT and checks if their bounding boxes satisfy the expected spatial relationship (left/right/above/below).

Classes Functions
fastvideo.eval.metrics.vbench.subject_consistency
Modules
fastvideo.eval.metrics.vbench.subject_consistency.metric

VBench Subject Consistency — DINO ViT-B/16 temporal feature similarity.

Measures how well the main subject maintains its appearance throughout the video via cosine similarity of DINO features between consecutive frames and the first frame.

Classes Functions
fastvideo.eval.metrics.vbench.temporal_flickering
Modules
fastvideo.eval.metrics.vbench.temporal_flickering.metric

VBench Temporal Flickering — measures frame-to-frame stability.

Score = (255 - mean_MAE) / 255, where MAE is computed between consecutive frames in uint8 [0, 255] space. Higher = less flickering.

Classes Functions
fastvideo.eval.metrics.vbench.temporal_style
Modules
fastvideo.eval.metrics.vbench.temporal_style.metric

VBench Temporal Style — ViCLIP text-video alignment (style focus).

Identical logic to overall_consistency — same ViCLIP cosine similarity. The difference is semantic: overall_consistency measures general prompt alignment while temporal_style measures style consistency over time. VBench uses different prompts for each from its metadata JSON.

Functions
fastvideo.eval.metrics.videoscore2
Modules
fastvideo.eval.metrics.videoscore2.metric

VideoScore2 — VLM-based video quality scoring.

Uses a Qwen2.5-VL model fine-tuned to score generated videos on three dimensions: visual quality, text-to-video alignment, and physical consistency. Scores are extracted from token logits as upstream's ll_based_soft_score_normed weighting (1-5 scale).

Reference: TIGER-AI-Lab/VideoScore2 (vs2_inference.py).

Classes
fastvideo.eval.metrics.videoscore2.metric.VideoScore2Metric
VideoScore2Metric(model_name: str = 'TIGER-Lab/VideoScore2', infer_fps: float = 2.0, max_tokens: int = 1024, temperature: float = 0.7, do_sample: bool = True)

Bases: BaseMetric

VideoScore2: VLM-based video quality scoring (3 dimensions).

Requires sample["text_prompt"] for text-to-video alignment. Supports batched generation for GPU efficiency.

Source code in fastvideo/eval/metrics/videoscore2/metric.py
def __init__(
    self,
    model_name: str = "TIGER-Lab/VideoScore2",
    infer_fps: float = 2.0,
    max_tokens: int = 1024,
    temperature: float = 0.7,
    do_sample: bool = True,
) -> None:
    super().__init__()
    self._model_name = model_name
    self.infer_fps = infer_fps
    self.max_tokens = max_tokens
    self.temperature = temperature
    self.do_sample = do_sample
    self._model: Any = None
    self._processor: Any = None
    self._tokenizer: Any = None
Functions

fastvideo.eval.models

Model checkpoint resolution and caching for eval metrics.

Single public function — :func:ensure_checkpoint — that hands a metric a local path to its weights, downloading on miss. Three source kinds:

  • an existing local path → returned as-is;
  • an HTTP(S) URL → downloaded to get_cache_dir() / name via huggingface_hub.http_get (resume + retries built in);
  • an HF repo id ("org/repo") → :func:huggingface_hub.hf_hub_download if filename is given, else :func:huggingface_hub.snapshot_download. name is ignored in HF mode — HF manages its own cache key under ~/.cache/huggingface/hub.

All download paths are filelock-safe (cooperative across threads, processes, and SLURM ranks) via :func:fastvideo.utils.get_lock.

Functions

fastvideo.eval.models.ensure_checkpoint
ensure_checkpoint(name: str, source: str, filename: str | None = None) -> str

Resolve a model checkpoint path, downloading on miss.

See module docstring for the full source contract. name is used only as the local cache filename for URL sources; ignored otherwise.

Source code in fastvideo/eval/models.py
def ensure_checkpoint(
    name: str,
    source: str,
    filename: str | None = None,
) -> str:
    """Resolve a model checkpoint path, downloading on miss.

    See module docstring for the full source contract. *name* is used
    only as the local cache filename for URL sources; ignored otherwise.
    """
    if os.path.exists(source):
        return source

    if source.startswith(("http://", "https://")):
        return _ensure_url(name, source)

    if "/" in source:
        return _ensure_hf(source, filename)

    raise ValueError(f"Cannot resolve checkpoint: source {source!r} is neither a "
                     "path, URL, nor HF repo id")
fastvideo.eval.models.get_cache_dir
get_cache_dir() -> Path

Eval cache root.

Layout::

get_cache_dir() / models /   ← URL-fetched checkpoints (LAION head,
                               AMT, GRiT, …)
get_cache_dir() / torch  /   ← redirected ``TORCH_HOME`` (DINO etc.)
get_cache_dir() / clip   /   ← passed as ``download_root`` to
                               ``clip.load(...)`` callsites
~/.cache/huggingface/hub /   ← left at HF's default; widely shared
                               with other ML projects

Override priority: FASTVIDEO_EVAL_CACHE > ${FASTVIDEO_CACHE_ROOT}/eval.

Metric authors writing new code: when wrapping a third-party loader that has its own cache convention (CLIP's download_root, pyiqa's cache_dir, etc.), pass str(get_cache_dir() / "<library>") so users get a single FASTVIDEO_EVAL_CACHE knob to redirect them all.

Source code in fastvideo/eval/models.py
def get_cache_dir() -> Path:
    """Eval cache root.

    Layout::

        get_cache_dir() / models /   ← URL-fetched checkpoints (LAION head,
                                       AMT, GRiT, …)
        get_cache_dir() / torch  /   ← redirected ``TORCH_HOME`` (DINO etc.)
        get_cache_dir() / clip   /   ← passed as ``download_root`` to
                                       ``clip.load(...)`` callsites
        ~/.cache/huggingface/hub /   ← left at HF's default; widely shared
                                       with other ML projects

    Override priority: ``FASTVIDEO_EVAL_CACHE`` > ``${FASTVIDEO_CACHE_ROOT}/eval``.

    Metric authors writing new code: when wrapping a third-party loader
    that has its own cache convention (CLIP's ``download_root``, pyiqa's
    ``cache_dir``, etc.), pass ``str(get_cache_dir() / "<library>")`` so
    users get a single ``FASTVIDEO_EVAL_CACHE`` knob to redirect them all.
    """
    return Path(os.environ.get(
        "FASTVIDEO_EVAL_CACHE",
        os.path.join(envs.FASTVIDEO_CACHE_ROOT, "eval"),
    ))

fastvideo.eval.pool

Async path-→-tensor prefetcher for the Evaluator.

Hides video-decode latency behind metric compute by running a small thread pool of decoders that fill a bounded queue. One :class:VideoPool is owned by the Evaluator per evaluate(samples=...) call; workers consume via :meth:VideoPool.get. Decode order is non-deterministic; each yielded item carries its original input index so consumers can write back into a result list in input order.

Pool sizing: max_size = prefetch_factor * num_workers.

Classes

fastvideo.eval.pool.VideoPool
VideoPool(samples: list[dict], *, loader_threads: int = 1, max_size: int = 4)

Bounded prefetch queue feeding decoded samples to consumers.

Use as a context manager so loader threads are always cleaned up::

with VideoPool(samples, loader_threads=1, max_size=4) as pool:
    while True:
        item = pool.get()
        if item is None:
            break
        idx, decoded = item
        results[idx] = worker.evaluate(**decoded)
Source code in fastvideo/eval/pool.py
def __init__(
    self,
    samples: list[dict],
    *,
    loader_threads: int = 1,
    max_size: int = 4,
) -> None:
    if loader_threads < 1:
        raise ValueError("loader_threads must be >= 1")
    self._samples = samples
    self._loader_threads_n = loader_threads
    self._max_size = max(max_size, 1)

    self._task_q: queue.Queue = queue.Queue()
    self._ready_q: queue.Queue = queue.Queue(maxsize=self._max_size)
    self._loaders: list[threading.Thread] = []
    self._stop = threading.Event()

    self._consumed = 0
    self._consume_lock = threading.Lock()
Functions
fastvideo.eval.pool.VideoPool.get
get() -> tuple[int, dict] | None

Pop the next decoded (idx, sample).

Returns None when all input samples have been consumed. Polls in 0.1 s slices so extra consumer threads (when len(samples) < num_workers) wake up periodically to re-check _consumed and exit cleanly — without the poll, a blocking _ready_q.get(timeout=None) deadlocks because the loaders are already done. Re-raises any exception caught in a loader thread on the consumer's stack so callers don't hang on a dead loader. Thread-safe: multiple consumer threads may share one pool.

Source code in fastvideo/eval/pool.py
def get(self) -> tuple[int, dict] | None:
    """Pop the next decoded ``(idx, sample)``.

    Returns ``None`` when all input samples have been consumed.
    Polls in 0.1 s slices so extra consumer threads (when
    ``len(samples) < num_workers``) wake up periodically to
    re-check ``_consumed`` and exit cleanly — without the poll,
    a blocking ``_ready_q.get(timeout=None)`` deadlocks because
    the loaders are already done. Re-raises any exception caught
    in a loader thread on the consumer's stack so callers don't
    hang on a dead loader. Thread-safe: multiple consumer threads
    may share one pool.
    """
    while True:
        with self._consume_lock:
            if self._consumed >= len(self._samples):
                return None
        try:
            item = self._ready_q.get(timeout=0.1)
        except queue.Empty:
            continue
        with self._consume_lock:
            self._consumed += 1
        idx, payload = item
        if isinstance(payload, _DecodeError):
            raise payload.exc
        return item

fastvideo.eval.registry

Classes

Functions

fastvideo.eval.registry.get_metric
get_metric(name: str, **kwargs: Any) -> BaseMetric

Instantiate a registered metric by name.

Checks that optional dependencies are installed before instantiation and gives a clear install hint pointing at the right extra group.

Source code in fastvideo/eval/registry.py
def get_metric(name: str, **kwargs: Any) -> BaseMetric:
    """Instantiate a registered metric by name.

    Checks that optional dependencies are installed before instantiation
    and gives a clear install hint pointing at the right extra group.
    """
    cls = _REGISTRY.get(name)
    if cls is None:
        available = ", ".join(sorted(_REGISTRY.keys()))
        raise KeyError(f"Unknown metric '{name}'. Available: {available}")

    for dep in getattr(cls, "dependencies", []):
        if not importlib.util.find_spec(dep):
            raise ImportError(f"{cls.__name__} requires '{dep}'. "
                              f"Install with: {_install_hint(name, dep)}")

    return cls(**kwargs)
fastvideo.eval.registry.list_metrics
list_metrics() -> list[str]

Return sorted list of all registered metric names.

Source code in fastvideo/eval/registry.py
def list_metrics() -> list[str]:
    """Return sorted list of all registered metric names."""
    return sorted(_REGISTRY.keys())
fastvideo.eval.registry.missing_dependencies
missing_dependencies(metric_name: str) -> list[str]

Importable module names declared by metric_name that are not actually importable in this environment. Returns [] if all deps are satisfied or the metric is unknown.

Used by group-style resolution to decide which metrics to silently skip (vs. naming a metric explicitly, where the missing dep should surface as :class:ImportError).

Source code in fastvideo/eval/registry.py
def missing_dependencies(metric_name: str) -> list[str]:
    """Importable module names declared by *metric_name* that are not
    actually importable in this environment. Returns ``[]`` if all deps
    are satisfied or the metric is unknown.

    Used by group-style resolution to decide which metrics to silently
    skip (vs. naming a metric explicitly, where the missing dep should
    surface as :class:`ImportError`).
    """
    cls = _REGISTRY.get(metric_name)
    if cls is None:
        return []
    return [d for d in getattr(cls, "dependencies", []) if not importlib.util.find_spec(d)]
fastvideo.eval.registry.register
register(name: str)

Decorator to register a metric class.

Usage::

@register("ssim")
class SSIMMetric(BaseMetric):
    ...
Source code in fastvideo/eval/registry.py
def register(name: str):
    """Decorator to register a metric class.

    Usage::

        @register("ssim")
        class SSIMMetric(BaseMetric):
            ...
    """

    def wrapper(cls):
        _REGISTRY[name] = cls
        return cls

    return wrapper
fastvideo.eval.registry.resolve_group
resolve_group(name: str) -> list[str] | None

If name is a group prefix (e.g. "vbench"), return all matching metric names. Returns None if name is not a group.

Source code in fastvideo/eval/registry.py
def resolve_group(name: str) -> list[str] | None:
    """If *name* is a group prefix (e.g. ``"vbench"``), return all matching
    metric names.  Returns ``None`` if *name* is not a group."""
    prefix = name + "."
    matches = sorted(k for k in _REGISTRY if k.startswith(prefix))
    return matches if matches else None

fastvideo.eval.types

Classes

fastvideo.eval.types.EvalResults
EvalResults(samples: list[dict[str, MetricResult]] | None = None, corpus: dict[str, MetricResult] | None = None)

Bases: list

Return type for :meth:Evaluator.evaluate with samples=....

Behaves like a list[dict[str, MetricResult]] — one dict per input sample, in input order — so existing iteration and indexing keeps working. The corpus attribute carries set-metric results (FAD, IS, …) that are properties of the whole input set, not of any individual sample. Empty dict when no set metric ran.

Source code in fastvideo/eval/types.py
def __init__(
    self,
    samples: list[dict[str, MetricResult]] | None = None,
    corpus: dict[str, MetricResult] | None = None,
) -> None:
    super().__init__(samples or [])
    self.corpus: dict[str, MetricResult] = corpus or {}
fastvideo.eval.types.MetricResult dataclass
MetricResult(name: str, score: float | None, details: dict[str, Any] = dict())

Standard result container returned by all metrics.

score is None when the metric was skipped (e.g. missing required input). Check details["skipped"] for the reason.

fastvideo.eval.types.Video dataclass
Video(source: Any, fps: float | None = None, frames: Any = None, audio: Any = None, audio_sr: int | None = None)

Path-backed media handle. The :class:VideoPool populates frames (and optionally audio) before the metric loop sees the sample.

fastvideo.eval.worker

EvalWorker: a single-GPU bag of metric replicas.

One EvalWorker per GPU. Holds an instance of every requested metric on its own device, scores one sample at a time. Stateless across calls — the worker never sees more than one evaluate(...) invocation at a time (the parent :class:Evaluator ensures this by handing the worker to one thread at a time).

This mirrors FastVideo's Worker layer under :class:VideoGenerator, but in-process (threads, not processes) — eval metrics are independent and need no NCCL / TP / SP / distributed init, so process isolation is unnecessary overhead.

Classes

fastvideo.eval.worker.EvalWorker
EvalWorker(metric_names: list[str], device: str, *, compile: bool = False, pre_upload: bool = True, skip_missing_deps: bool = False)

Owns metric replicas on one device. Single-GPU, single-sample.

Per-sample metrics (is_set_metric=False) return one :class:MetricResult per evaluate call. Set metrics (is_set_metric=True) accumulate state on the worker's own instance and contribute nothing to the per-sample return — the Evaluator finalizes them after the pool drains.

Source code in fastvideo/eval/worker.py
def __init__(self,
             metric_names: list[str],
             device: str,
             *,
             compile: bool = False,
             pre_upload: bool = True,
             skip_missing_deps: bool = False) -> None:
    self._names = list(metric_names)
    self._device = device
    self._compile = compile
    self._pre_upload = pre_upload
    self._skip_missing_deps = skip_missing_deps
    self._metrics: dict[str, BaseMetric] = {}
    self._unloaded = False
    self._load()
Attributes
fastvideo.eval.worker.EvalWorker.metric_names property
metric_names: list[str]

Names of the metrics this worker owns, in load order.

Functions
fastvideo.eval.worker.EvalWorker.evaluate
evaluate(*, metrics: list[str] | None = None, **kwargs) -> dict[str, MetricResult]

Score one already-decoded sample.

Inputs (video / reference paths) are decoded and normalized upstream by the :class:VideoPool. The worker handles the optional pre-upload to its device and dispatches to each metric's compute or accumulate.

Samples tagged role="reference" are corpus context for set metrics only — per-sample metrics skip them so a mixed Evaluator (e.g. FVD + LPIPS + vbench) doesn't produce spurious per-sample scores on the reference half of the corpus.

metrics (when not None) further restricts dispatch to that subset of registered names — used by the per-call Evaluator.evaluate(metrics=...) filter.

Source code in fastvideo/eval/worker.py
def evaluate(self, *, metrics: list[str] | None = None, **kwargs) -> dict[str, MetricResult]:
    """Score one already-decoded sample.

    Inputs (``video`` / ``reference`` paths) are decoded and
    normalized upstream by the :class:`VideoPool`. The worker
    handles the optional pre-upload to its device and dispatches to
    each metric's ``compute`` or ``accumulate``.

    Samples tagged ``role="reference"`` are corpus context for set
    metrics only — per-sample metrics skip them so a mixed
    Evaluator (e.g. FVD + LPIPS + vbench) doesn't produce spurious
    per-sample scores on the reference half of the corpus.

    ``metrics`` (when not ``None``) further restricts dispatch to
    that subset of registered names — used by the per-call
    ``Evaluator.evaluate(metrics=...)`` filter.
    """
    if self._unloaded:
        raise RuntimeError("EvalWorker was unloaded; call reload() before evaluating.")

    sample = dict(kwargs)
    # Unwrap ``Video`` → its ``.frames`` tensor for per-sample metric
    # consumers (PSNR/SSIM/LPIPS/optical_flow). The pool decodes Video
    # instances but leaves the wrapper in place so source-aware metrics
    # (audio.imagebind_score) can still read ``Video.source``; those
    # metrics should additionally pass ``sample["video_path"]`` if they
    # need the path. Per-sample tensor consumers get a tensor either way.
    for _k in ("video", "reference"):
        _v = sample.get(_k)
        if isinstance(_v, Video):
            sample[_k] = _v.frames
    if self._pre_upload and self._any_needs_gpu:
        sample["video"] = _to_device(sample.get("video"), self._device)
        if "reference" in sample:
            sample["reference"] = _to_device(sample["reference"], self._device)

    is_ref = sample.get("role") == "reference"
    filter_set: set[str] | None = set(metrics) if metrics is not None else None
    results: dict[str, MetricResult] = {}
    broken: list[str] = []
    for name, m in self._metrics.items():
        if filter_set is not None and name not in filter_set:
            continue
        try:
            if m.is_set_metric:
                m.accumulate(sample)
            elif not is_ref:
                results[name] = m.compute(sample)
        except (ImportError, ModuleNotFoundError, FileNotFoundError) as e:
            # In skip_missing_deps mode, compute/accumulate-time failures
            # whose root cause is a missing dependency (lazy import of a
            # lib like torchcodec) or a missing resource (model checkpoint
            # not on disk) drop the metric for the remainder of this
            # Evaluator instead of bringing the whole run down. Strict
            # mode re-raises. Programmer bugs, OOM, and other runtime
            # failures are intentionally NOT caught here — they surface.
            if not self._skip_missing_deps:
                raise
            import logging
            logging.getLogger(__name__).exception("eval: dropping %s after %s: %s", name, type(e).__name__, e)
            broken.append(name)
    for n in broken:
        self._metrics.pop(n, None)
    return results
fastvideo.eval.worker.EvalWorker.release_cuda_memory
release_cuda_memory() -> None

Free CUDA caches without dropping models.

Source code in fastvideo/eval/worker.py
def release_cuda_memory(self) -> None:
    """Free CUDA caches without dropping models."""
    clear_cache()
    if torch.cuda.is_available():
        with contextlib.suppress(Exception):
            torch.cuda.ipc_collect()
fastvideo.eval.worker.EvalWorker.reload
reload() -> None

Rebuild metrics dropped by :meth:unload.

Source code in fastvideo/eval/worker.py
def reload(self) -> None:
    """Rebuild metrics dropped by :meth:`unload`."""
    if self._unloaded:
        self._load()
fastvideo.eval.worker.EvalWorker.reset_set_metrics
reset_set_metrics() -> None

Clear accumulator state on every set metric.

Source code in fastvideo/eval/worker.py
def reset_set_metrics(self) -> None:
    """Clear accumulator state on every set metric."""
    for m in self._metrics.values():
        if m.is_set_metric:
            m.reset()
fastvideo.eval.worker.EvalWorker.set_metrics
set_metrics() -> dict[str, BaseMetric]

Return {name: instance} for set metrics on this worker.

Source code in fastvideo/eval/worker.py
def set_metrics(self) -> dict[str, BaseMetric]:
    """Return ``{name: instance}`` for set metrics on this worker."""
    return {n: m for n, m in self._metrics.items() if m.is_set_metric}
fastvideo.eval.worker.EvalWorker.unload
unload() -> None

Drop metric refs so models become GC-able. Reverse with reload().

Source code in fastvideo/eval/worker.py
def unload(self) -> None:
    """Drop metric refs so models become GC-able. Reverse with reload()."""
    self._metrics = {}
    self._unloaded = True
    self.release_cuda_memory()

Functions