Skip to content

eval ¶

Classes¶

fastvideo.eval.BaseMetric ¶

BaseMetric()

Abstract base class for all eval metrics.

Two execution shapes:

Per-sample (is_set_metric=False, default) — implement :meth:compute. The Evaluator calls it once per input sample and returns one :class:MetricResult per sample.

Set-vs-set (is_set_metric=True) — implement :meth:accumulate (called once per sample to buffer features) and :meth:finalize (called once after all samples to compute the corpus-level result). Use :meth:reset to clear buffers and :meth:merge_from to fold multi-GPU per-worker state together.

Optionally override :meth:setup to eagerly load models. Metrics that chunk along the time dim for memory hardcode their own chunk size in __init__ (see optical_flow for the canonical example). Eval always processes one video per :meth:Evaluator.evaluate call; compute / accumulate receive a single sample, not a batch.

Source code in fastvideo/eval/metrics/base.py

def __init__(self) -> None:
    self._device: torch.device = torch.device("cpu")

Methods:¶

fastvideo.eval.BaseMetric.accumulate ¶

accumulate(sample: dict) -> None

Buffer per-sample features for a corpus-level metric.

Source code in fastvideo/eval/metrics/base.py

def accumulate(self, sample: dict) -> None:
    """Buffer per-sample features for a corpus-level metric."""
    raise NotImplementedError(f"{type(self).__name__}.accumulate is not implemented")

fastvideo.eval.BaseMetric.compute ¶

compute(sample: dict) -> MetricResult

Per-sample metrics: compute the score for one sample.

sample["video"] is (T, C, H, W) float in [0, 1]. sample["reference"] (if used) has the same shape. Return self._skip(sample, reason) for missing inputs.

Source code in fastvideo/eval/metrics/base.py

def compute(self, sample: dict) -> MetricResult:
    """Per-sample metrics: compute the score for one sample.

    ``sample["video"]`` is ``(T, C, H, W)`` float in ``[0, 1]``.
    ``sample["reference"]`` (if used) has the same shape. Return
    ``self._skip(sample, reason)`` for missing inputs.
    """
    raise NotImplementedError(f"{type(self).__name__}.compute is not implemented")

fastvideo.eval.BaseMetric.finalize ¶

finalize() -> MetricResult

Compute the corpus-level result from buffered state.

Source code in fastvideo/eval/metrics/base.py

def finalize(self) -> MetricResult:
    """Compute the corpus-level result from buffered state."""
    raise NotImplementedError(f"{type(self).__name__}.finalize is not implemented")

fastvideo.eval.BaseMetric.merge_from ¶

merge_from(other: BaseMetric) -> None

Multi-GPU: fold another worker's accumulator state into this one.

Source code in fastvideo/eval/metrics/base.py

def merge_from(self, other: BaseMetric) -> None:  # noqa: B027 - intentionally optional override
    """Multi-GPU: fold another worker's accumulator state into this one."""

fastvideo.eval.BaseMetric.reset ¶

reset() -> None

Clear accumulator state at the start of each evaluate() call.

Source code in fastvideo/eval/metrics/base.py

def reset(self) -> None:  # noqa: B027 - intentionally optional override
    """Clear accumulator state at the start of each evaluate() call."""

fastvideo.eval.BaseMetric.setup ¶

setup() -> None

Eagerly load models. Called once by :class:EvalWorker.

Default is a no-op; metrics with no eager state (pixel math, closed-form ops) inherit this. Override only if your metric needs to load weights.

Source code in fastvideo/eval/metrics/base.py

def setup(self) -> None:  # noqa: B027 - intentionally optional override
    """Eagerly load models. Called once by :class:`EvalWorker`.

    Default is a no-op; metrics with no eager state (pixel math,
    closed-form ops) inherit this. Override only if your metric
    needs to load weights.
    """

fastvideo.eval.BaseMetric.to ¶

to(device: str | device) -> BaseMetric

Move metric (and its internal models) to device.

Source code in fastvideo/eval/metrics/base.py

def to(self, device: str | torch.device) -> BaseMetric:
    """Move metric (and its internal models) to *device*."""
    self._device = torch.device(device)
    return self

fastvideo.eval.EvalResults ¶

EvalResults(samples: list[dict[str, MetricResult]] | None = None, corpus: dict[str, MetricResult] | None = None)

Bases: list

Return type for :meth:Evaluator.evaluate with samples=....

Behaves like a list[dict[str, MetricResult]] — one dict per input sample, in input order — so existing iteration and indexing keeps working. The corpus attribute carries set-metric results (FAD, IS, …) that are properties of the whole input set, not of any individual sample. Empty dict when no set metric ran.

Source code in fastvideo/eval/types.py

def __init__(
    self,
    samples: list[dict[str, MetricResult]] | None = None,
    corpus: dict[str, MetricResult] | None = None,
) -> None:
    super().__init__(samples or [])
    self.corpus: dict[str, MetricResult] = corpus or {}

fastvideo.eval.Evaluator ¶

Evaluator(metrics: list[str] | str = 'all', device: str = 'cuda:0', num_gpus: int = 1, compile: bool = False, *, loader_threads: int = 1, prefetch_factor: int = 2, pre_upload: bool = True, skip_missing_deps: bool = False)

Pre-initialized scorer for repeated evaluation.

Parameters¶

metrics : list[str] | str Metric names, group prefixes ("vbench"), or "all". device : str Single-GPU device (e.g. "cuda:0"). Ignored when num_gpus > 1. num_gpus : int Number of GPU replicas. Each gets its own :class:EvalWorker. compile : bool Apply :func:torch.compile to each metric's _model. loader_threads : int Background decode threads in the :class:VideoPool. Default 1 (hide decode behind compute). Bump for I/O-heavy benchmark sets where one loader can't keep up with the workers. prefetch_factor : int pool max_size = prefetch_factor * num_workers. Default 2 — one sample being consumed, one prefetched per worker. pre_upload : bool When True (default), the worker performs a single host→device upload of video / reference per sample before the metric loop, and every metric reads from that shared GPU-resident tensor. Without it, each metric pays its own .to(self.device) — N transfers of the same clip for N metrics, which dominates at high resolution. Set False for training-time eval, where keeping a clip resident on GPU across the metric loop would fight the training step for VRAM. skip_missing_deps : bool When True, silently drop explicit metric names whose optional deps aren't importable (with a one-line warning per skipped metric). Default False — an explicit name with a missing dep raises :class:ImportError at construction time. Group selectors ("vbench", "all") always silent-skip regardless of this flag.

Source code in fastvideo/eval/evaluator.py

def __init__(
    self,
    metrics: list[str] | str = "all",
    device: str = "cuda:0",
    num_gpus: int = 1,
    compile: bool = False,
    *,
    loader_threads: int = 1,
    prefetch_factor: int = 2,
    pre_upload: bool = True,
    skip_missing_deps: bool = False,
) -> None:
    names = _resolve_metric_names(metrics, skip_missing_deps=skip_missing_deps)
    if num_gpus > 1:
        self._workers = [
            EvalWorker(names,
                       f"cuda:{i}",
                       compile=compile,
                       pre_upload=pre_upload,
                       skip_missing_deps=skip_missing_deps) for i in range(num_gpus)
        ]
    else:
        self._workers = [
            EvalWorker(names, device, compile=compile, pre_upload=pre_upload, skip_missing_deps=skip_missing_deps)
        ]
    self._loader_threads = max(1, loader_threads)
    self._prefetch_factor = max(1, prefetch_factor)

Methods:¶

fastvideo.eval.Evaluator.evaluate ¶

evaluate(samples: Iterable[dict] | None = None, *, metrics: list[str] | None = None, **kwargs) -> dict[str, MetricResult] | EvalResults

Score one sample (kwargs form) or many samples (list form).

Both forms go through the same :class:VideoPool pipeline; video / reference paths are decoded asynchronously.

Parameters¶

samples : Iterable of sample dicts. Omit and pass kwargs for a single-sample call. metrics : Subset of this Evaluator's registered metrics to actually run on this batch. None (default) runs all registered. Lets a single long-lived Evaluator score different (gen, ref) corpora with different metric subsets across multiple evaluate() calls — e.g. LPIPS on a paired corpus, FVD on an unequal-cardinality corpus — without burning model loads. Set-metric accumulators are reset only for the metrics included in metrics, so state for other set metrics is preserved across calls.

Single sample::

ev.evaluate(video=tensor, text_prompt="...", fps=24.0)

Many samples::

ev.evaluate(samples=[{"video": ..., "reference": ...}, ...])

Many samples with a metric filter::

ev.evaluate(samples=lpips_samples, metrics=["common.lpips"])
ev.evaluate(samples=fvd_samples,   metrics=["common.fvd"])

Returns¶

dict[str, MetricResult] for the single-sample form; :class:EvalResults (list-of-dict subclass with .corpus) for the list form.

Source code in fastvideo/eval/evaluator.py

def evaluate(
    self,
    samples: Iterable[dict] | None = None,
    *,
    metrics: list[str] | None = None,
    **kwargs,
) -> dict[str, MetricResult] | EvalResults:
    """Score one sample (kwargs form) or many samples (list form).

    Both forms go through the same :class:`VideoPool` pipeline;
    ``video`` / ``reference`` paths are decoded asynchronously.

    Parameters
    ----------
    samples :
        Iterable of sample dicts.  Omit and pass kwargs for a
        single-sample call.
    metrics :
        Subset of this Evaluator's registered metrics to actually
        run on this batch.  ``None`` (default) runs all registered.
        Lets a single long-lived Evaluator score different (gen,
        ref) corpora with different metric subsets across multiple
        ``evaluate()`` calls — e.g. LPIPS on a paired corpus, FVD
        on an unequal-cardinality corpus — without burning model
        loads.  Set-metric accumulators are reset only for the
        metrics included in *metrics*, so state for other set
        metrics is preserved across calls.

    Single sample::

        ev.evaluate(video=tensor, text_prompt="...", fps=24.0)

    Many samples::

        ev.evaluate(samples=[{"video": ..., "reference": ...}, ...])

    Many samples with a metric filter::

        ev.evaluate(samples=lpips_samples, metrics=["common.lpips"])
        ev.evaluate(samples=fvd_samples,   metrics=["common.fvd"])

    Returns
    -------
    dict[str, MetricResult] for the single-sample form;
    :class:`EvalResults` (list-of-dict subclass with ``.corpus``) for
    the list form.
    """
    if metrics is not None:
        unknown = [m for m in metrics if m not in self.metric_names]
        if unknown:
            raise ValueError(f"metrics filter contains names not registered on this Evaluator: "
                             f"{unknown}; registered: {self.metric_names}")

    single = samples is None
    sample_list: list[dict] = [kwargs] if samples is None else list(samples)
    if not sample_list:
        return EvalResults(samples=[], corpus={})

    if single:
        set_names = self._workers[0].set_metrics().keys()
        active_set = (set_names if metrics is None else (set_names & set(metrics)))
        if active_set:
            # Set metrics need a population. A single sample can't produce a
            # meaningful corpus result, and silently discarding it (return
            # ``per_sample[0]`` only) hides the no-op. Force the list form.
            raise ValueError("Set-vs-set metrics require samples=[...] with >=2 entries; "
                             "the kwargs form (single sample) cannot produce a corpus "
                             f"result. Active set metrics: {sorted(active_set)}")

    per_sample, corpus = self._run(sample_list, metric_filter=metrics)

    if single:
        return per_sample[0]
    return EvalResults(samples=per_sample, corpus=corpus)

fastvideo.eval.Evaluator.release_cuda_memory ¶

release_cuda_memory() -> None

Free CUDA caches on every replica without dropping models.

Source code in fastvideo/eval/evaluator.py

def release_cuda_memory(self) -> None:
    """Free CUDA caches on every replica without dropping models."""
    for w in self._workers:
        w.release_cuda_memory()

fastvideo.eval.Evaluator.reload ¶

reload() -> None

Rebuild metrics dropped by :meth:unload.

Source code in fastvideo/eval/evaluator.py

def reload(self) -> None:
    """Rebuild metrics dropped by :meth:`unload`."""
    for w in self._workers:
        w.reload()

fastvideo.eval.Evaluator.shutdown ¶

shutdown() -> None

No-op; kept for API compatibility with older callers.

Source code in fastvideo/eval/evaluator.py

def shutdown(self) -> None:
    """No-op; kept for API compatibility with older callers."""

fastvideo.eval.Evaluator.unload ¶

unload() -> None

Drop metric refs on every replica. Reverse with :meth:reload.

Source code in fastvideo/eval/evaluator.py

def unload(self) -> None:
    """Drop metric refs on every replica. Reverse with :meth:`reload`."""
    for w in self._workers:
        w.unload()

fastvideo.eval.MetricResult `dataclass` ¶

MetricResult(name: str, score: float | None, details: dict[str, Any] = dict())

Standard result container returned by all metrics.

score is None when the metric was skipped (e.g. missing required input). Check details["skipped"] for the reason.

fastvideo.eval.Video `dataclass` ¶

Video(source: Any, fps: float | None = None, frames: Any = None, audio: Any = None, audio_sr: int | None = None)

Path-backed media handle. The :class:VideoPool populates frames (and optionally audio) before the metric loop sees the sample.

Functions:¶

fastvideo.eval.as_video ¶

as_video(x: str | Path | Tensor | Video) -> Video

Coerce path/tensor/Video → :class:Video for the pool to decode.

Path strings and :class:pathlib.Path become Video(source=str(x)); the pool then calls :func:load_video on first use. Tensors become Video(source=None, frames=x) — the pool sees .frames already populated and forwards untouched. :class:Video instances pass through.

Source code in fastvideo/eval/io/inputs.py

def as_video(x: str | Path | torch.Tensor | Video) -> Video:
    """Coerce path/tensor/Video → :class:`Video` for the pool to decode.

    Path strings and :class:`pathlib.Path` become ``Video(source=str(x))``;
    the pool then calls :func:`load_video` on first use.  Tensors become
    ``Video(source=None, frames=x)`` — the pool sees ``.frames`` already
    populated and forwards untouched.  :class:`Video` instances pass through.
    """
    if isinstance(x, Video):
        return x
    if isinstance(x, str | Path):
        return Video(source=str(x))
    if isinstance(x, torch.Tensor):
        return Video(source=None, frames=x)
    raise TypeError(f"Cannot coerce {type(x).__name__} to Video")

fastvideo.eval.ensure_checkpoint ¶

ensure_checkpoint(name: str, source: str, filename: str | None = None) -> str

Resolve a model checkpoint path, downloading on miss.

See module docstring for the full source contract. name is used only as the local cache filename for URL sources; ignored otherwise.

Source code in fastvideo/eval/models.py

def ensure_checkpoint(
    name: str,
    source: str,
    filename: str | None = None,
) -> str:
    """Resolve a model checkpoint path, downloading on miss.

    See module docstring for the full source contract. *name* is used
    only as the local cache filename for URL sources; ignored otherwise.
    """
    if os.path.exists(source):
        return source

    if source.startswith(("http://", "https://")):
        return _ensure_url(name, source)

    if "/" in source:
        return _ensure_hf(source, filename)

    raise ValueError(f"Cannot resolve checkpoint: source {source!r} is neither a "
                     "path, URL, nor HF repo id")

fastvideo.eval.evaluate ¶

evaluate(generated: Tensor | str | Path, reference: Tensor | str | Path | None = None, metrics: list[str] | str = 'all', device: str = 'cuda', **kwargs) -> dict[str, MetricResult] | list[dict[str, MetricResult]]

One-shot evaluation. For repeated use, prefer :func:create_evaluator.

Parameters¶

generated : Tensor | str | Path Generated video. Either a pre-loaded (T, C, H, W) tensor or a path to an mp4/avi/etc. — paths are decoded by the worker. reference : Tensor | str | Path | None Reference video (same accepted shapes as generated). metrics : list[str] | str Metric names, or "all". device : str PyTorch device string.

Source code in fastvideo/eval/api.py

def evaluate(
    generated: torch.Tensor | str | Path,
    reference: torch.Tensor | str | Path | None = None,
    metrics: list[str] | str = "all",
    device: str = "cuda",
    **kwargs,
) -> dict[str, MetricResult] | list[dict[str, MetricResult]]:
    """One-shot evaluation. For repeated use, prefer :func:`create_evaluator`.

    Parameters
    ----------
    generated : Tensor | str | Path
        Generated video. Either a pre-loaded ``(T, C, H, W)`` tensor or a
        path to an mp4/avi/etc. — paths are decoded by the worker.
    reference : Tensor | str | Path | None
        Reference video (same accepted shapes as *generated*).
    metrics : list[str] | str
        Metric names, or ``"all"``.
    device : str
        PyTorch device string.
    """
    ev = create_evaluator(metrics=metrics, device=device)
    kw: dict = {"video": generated, **kwargs}
    if reference is not None:
        kw["reference"] = reference
    return ev.evaluate(**kw)

fastvideo.eval.get_cache_dir ¶

get_cache_dir() -> Path

Eval cache root.

Layout::

get_cache_dir() / models /   ← URL-fetched checkpoints (LAION head,
                               AMT, GRiT, …)
get_cache_dir() / torch  /   ← redirected ``TORCH_HOME`` (DINO etc.)
get_cache_dir() / clip   /   ← passed as ``download_root`` to
                               ``clip.load(...)`` callsites
~/.cache/huggingface/hub /   ← left at HF's default; widely shared
                               with other ML projects

Override priority: FASTVIDEO_EVAL_CACHE > ${FASTVIDEO_CACHE_ROOT}/eval.

Metric authors writing new code: when wrapping a third-party loader that has its own cache convention (CLIP's download_root, pyiqa's cache_dir, etc.), pass str(get_cache_dir() / "<library>") so users get a single FASTVIDEO_EVAL_CACHE knob to redirect them all.

Source code in fastvideo/eval/models.py

def get_cache_dir() -> Path:
    """Eval cache root.

    Layout::

        get_cache_dir() / models /   ← URL-fetched checkpoints (LAION head,
                                       AMT, GRiT, …)
        get_cache_dir() / torch  /   ← redirected ``TORCH_HOME`` (DINO etc.)
        get_cache_dir() / clip   /   ← passed as ``download_root`` to
                                       ``clip.load(...)`` callsites
        ~/.cache/huggingface/hub /   ← left at HF's default; widely shared
                                       with other ML projects

    Override priority: ``FASTVIDEO_EVAL_CACHE`` > ``${FASTVIDEO_CACHE_ROOT}/eval``.

    Metric authors writing new code: when wrapping a third-party loader
    that has its own cache convention (CLIP's ``download_root``, pyiqa's
    ``cache_dir``, etc.), pass ``str(get_cache_dir() / "<library>")`` so
    users get a single ``FASTVIDEO_EVAL_CACHE`` knob to redirect them all.
    """
    return Path(os.environ.get(
        "FASTVIDEO_EVAL_CACHE",
        os.path.join(envs.FASTVIDEO_CACHE_ROOT, "eval"),
    ))

fastvideo.eval.get_metric ¶

get_metric(name: str, **kwargs: Any) -> BaseMetric

Instantiate a registered metric by name.

Checks that optional dependencies are installed before instantiation and gives a clear install hint pointing at the right extra group.

Source code in fastvideo/eval/registry.py

def get_metric(name: str, **kwargs: Any) -> BaseMetric:
    """Instantiate a registered metric by name.

    Checks that optional dependencies are installed before instantiation
    and gives a clear install hint pointing at the right extra group.
    """
    cls = _REGISTRY.get(name)
    if cls is None:
        available = ", ".join(sorted(_REGISTRY.keys()))
        raise KeyError(f"Unknown metric '{name}'. Available: {available}")

    for dep in getattr(cls, "dependencies", []):
        if not importlib.util.find_spec(dep):
            raise ImportError(f"{cls.__name__} requires '{dep}'. "
                              f"Install with: {_install_hint(name, dep)}")

    return cls(**kwargs)

fastvideo.eval.list_metrics ¶

list_metrics() -> list[str]

Return sorted list of all registered metric names.

Source code in fastvideo/eval/registry.py

def list_metrics() -> list[str]:
    """Return sorted list of all registered metric names."""
    return sorted(_REGISTRY.keys())

fastvideo.eval.register ¶

register(name: str)

Decorator to register a metric class.

Usage::

@register("ssim")
class SSIMMetric(BaseMetric):
    ...

Source code in fastvideo/eval/registry.py

def register(name: str):
    """Decorator to register a metric class.

    Usage::

        @register("ssim")
        class SSIMMetric(BaseMetric):
            ...
    """

    def wrapper(cls):
        _REGISTRY[name] = cls
        return cls

    return wrapper

fastvideo.eval.samples_from ¶

samples_from(*, video: PathSpec | None = None, reference: PathSpec | None = None, audio: PathSpec | None = None, reference_audio: PathSpec | None = None, text_prompt: str | None = None, text_prompts: str | Path | list[str] | None = None, fps: float | None = None, auxiliary_info: dict | list[dict] | None = None, extras: dict | list[dict] | None = None, extract_audio: bool | str | Path = False, extract_workers: int = 4) -> list[dict]

Build a samples list from path-style inputs.

Parameters¶

video, reference, audio, reference_audio : File path, directory of files (sorted by name), or any iterable of paths. Pass whichever modalities apply to the metrics you plan to run — they attach to sample["video"] / sample["reference"] / sample["audio"] / sample["reference_audio"] respectively. Video paths are wrapped in :class:Video so :class:VideoPool decodes them lazily in parallel; audio paths stay as strings (audio metrics each load with their own resample / preprocess). text_prompt : A single prompt string broadcast onto every sample. text_prompts : A list of strings (one per sample), or a path to a .jsonl / .json file containing per-sample prompts. fps : Scalar fps broadcast onto every sample. auxiliary_info : Single dict (broadcast) or list of dicts (zipped) for sample["auxiliary_info"] — vbench structured-prompt metrics read this. extras : Catch-all per-sample attachments. Use for metric-specific keys the dedicated kwargs don't cover (scenario, view, actions, calibration, reference_take2, ...). Pass a single dict to broadcast or a list-of-dicts to zip; the keys merge into each sample dict. extract_audio : If truthy, auto-extract audio from each video / reference source into .wav files via PyAV and attach the paths under sample["audio"] / sample["reference_audio"]. Pass a path for a persistent cache, True for a tempdir. Skipped silently for videos with no audio stream; ignored wherever audio / reference_audio is already explicit. extract_workers : Parallel workers for extract_audio.

Returns¶

list[dict] Canonical samples shape — hand directly to :meth:Evaluator.evaluate.

Cardinality and shape¶

Let N = len(generated inputs) (the agreed length of whichever of video / audio you passed). References are attached 1:1 onto the first N samples; any extras (when |ref| > N) become standalone role-tagged samples at the end of the list, so set metrics like FVD see the full reference corpus while per-sample paired metrics like LPIPS only run on the first N pairs.

Notes¶

"Missing" keys are simply absent from the sample dict. Metrics handle them per their own contract (sample.get(...) for optional, sample[...] raises for required, or :meth:BaseMetric._skip for opt-in skip behavior). One fat samples list with many keys can serve many metrics — each reads its subset.

Source code in fastvideo/eval/io/inputs.py

def samples_from(
    *,
    # Modality inputs — pass whichever you have, in any combination.
    video: PathSpec | None = None,
    reference: PathSpec | None = None,
    audio: PathSpec | None = None,
    reference_audio: PathSpec | None = None,
    # Per-sample attachments.  Scalars broadcast to every sample; lists
    # / jsonl paths zip per-sample.
    text_prompt: str | None = None,
    text_prompts: str | Path | list[str] | None = None,
    fps: float | None = None,
    auxiliary_info: dict | list[dict] | None = None,
    # Catch-all for exotic metric inputs (physics_iq scenarios,
    # synthetic_optical_flow actions, ...).  Single dict broadcasts;
    # list of dicts zips.  Keys merge into each sample dict.
    extras: dict | list[dict] | None = None,
    # Sugar: pull audio off the video sources via PyAV.  Pass a path
    # to use a persistent on-disk cache, ``True`` for a system tempdir.
    extract_audio: bool | str | Path = False,
    extract_workers: int = 4,
) -> list[dict]:
    """Build a samples list from path-style inputs.

    Parameters
    ----------
    video, reference, audio, reference_audio :
        File path, directory of files (sorted by name), or any iterable
        of paths.  Pass whichever modalities apply to the metrics you
        plan to run — they attach to ``sample["video"]`` /
        ``sample["reference"]`` / ``sample["audio"]`` /
        ``sample["reference_audio"]`` respectively.  Video paths are
        wrapped in :class:`Video` so :class:`VideoPool` decodes them
        lazily in parallel; audio paths stay as strings (audio metrics
        each load with their own resample / preprocess).
    text_prompt :
        A single prompt string broadcast onto every sample.
    text_prompts :
        A list of strings (one per sample), or a path to a ``.jsonl`` /
        ``.json`` file containing per-sample prompts.
    fps :
        Scalar fps broadcast onto every sample.
    auxiliary_info :
        Single dict (broadcast) or list of dicts (zipped) for
        ``sample["auxiliary_info"]`` — vbench structured-prompt
        metrics read this.
    extras :
        Catch-all per-sample attachments.  Use for metric-specific keys
        the dedicated kwargs don't cover (``scenario``, ``view``,
        ``actions``, ``calibration``, ``reference_take2``, ...).  Pass
        a single dict to broadcast or a list-of-dicts to zip; the keys
        merge into each sample dict.
    extract_audio :
        If truthy, auto-extract audio from each ``video`` /
        ``reference`` source into ``.wav`` files via PyAV and attach
        the paths under ``sample["audio"]`` / ``sample["reference_audio"]``.
        Pass a path for a persistent cache, ``True`` for a tempdir.
        Skipped silently for videos with no audio stream; ignored
        wherever ``audio`` / ``reference_audio`` is already explicit.
    extract_workers :
        Parallel workers for ``extract_audio``.

    Returns
    -------
    list[dict]
        Canonical samples shape — hand directly to
        :meth:`Evaluator.evaluate`.

    Cardinality and shape
    ---------------------
    Let ``N = len(generated inputs)`` (the agreed length of whichever
    of ``video`` / ``audio`` you passed).  References are attached
    1:1 onto the first N samples; any extras (when ``|ref| > N``)
    become standalone role-tagged samples at the end of the list, so
    set metrics like FVD see the full reference corpus while per-sample
    paired metrics like LPIPS only run on the first N pairs.

    Notes
    -----
    "Missing" keys are simply absent from the sample dict.  Metrics
    handle them per their own contract (``sample.get(...)`` for
    optional, ``sample[...]`` raises for required, or
    :meth:`BaseMetric._skip` for opt-in skip behavior).  One fat samples
    list with many keys can serve many metrics — each reads its subset.
    """
    gen_video = _expand(video, _VIDEO_EXTS) if video is not None else None
    gen_audio = _expand(audio, _AUDIO_EXTS) if audio is not None else None
    ref_video = _expand(reference, _VIDEO_EXTS) if reference is not None else None
    ref_audio = _expand(reference_audio, _AUDIO_EXTS) if reference_audio is not None else None

    if gen_video is None and gen_audio is None:
        raise ValueError("samples_from: pass at least one of video= or audio=")

    # All "generated" inputs must agree on length — they describe the
    # same N samples in different modalities.
    n_candidates = [len(x) for x in (gen_video, gen_audio) if x is not None]
    if len({*n_candidates}) != 1:
        raise ValueError(f"samples_from: generated inputs have inconsistent lengths {n_candidates}")
    n = n_candidates[0]

    prompts = _broadcast(text_prompts if text_prompts is not None else text_prompt, n, loader=_load_prompts)
    aux = _broadcast(auxiliary_info, n)
    extras_per_sample = _broadcast(extras, n)
    if (text_prompts is not None) and (text_prompt is not None):
        raise ValueError("Pass either text_prompt (broadcast) or text_prompts (per-sample), not both.")

    samples: list[dict[str, Any]] = []
    for i in range(n):
        s: dict[str, Any] = {}
        if gen_video is not None:
            s["video"] = as_video(gen_video[i])
        if gen_audio is not None:
            s["audio"] = str(gen_audio[i])
        if ref_video is not None and i < len(ref_video):
            s["reference"] = as_video(ref_video[i])
        if ref_audio is not None and i < len(ref_audio):
            s["reference_audio"] = str(ref_audio[i])
        if prompts is not None:
            s["text_prompt"] = prompts[i]
        if fps is not None:
            s["fps"] = fps
        if aux is not None:
            s["auxiliary_info"] = aux[i]
        if extras_per_sample is not None:
            s.update(extras_per_sample[i])
        samples.append(s)

    # Unmatched references → role-tagged set samples.  Per-sample paired
    # metrics skip these via the worker's role-skip rule; set metrics
    # (FVD, FAD) accumulate the features.
    if ref_video is not None and len(ref_video) > n:
        for v in ref_video[n:]:
            samples.append({"video": as_video(v), "role": "reference"})
    if ref_audio is not None and len(ref_audio) > n:
        for a in ref_audio[n:]:
            samples.append({"audio": str(a), "role": "reference"})

    if extract_audio:
        _attach_extracted_audio(samples, extract_audio, extract_workers)
    return samples