Skip to content

inputs

Input-shape helpers: paths → samples list.

The :class:Evaluator and every metric consume the same internal representation — a list[dict] with one dict per sample. Building that list by hand is the largest source of ceremony in user scripts. :func:samples_from is a pure function that turns path-style inputs (generated videos / audio, optional paired references, optional per-sample prompts / fps / metadata) into the canonical samples list, ready to hand to :meth:Evaluator.evaluate.

Design rules:

  • No Evaluator state, no metric introspection, no I/O beyond reading the prompts file when given. Just dict assembly.
  • "Extra" keys on a sample are free — metrics each read what they need and ignore the rest. So one fat samples list naturally serves many metrics in one Evaluator.
  • No "primary modality" axis. Each modality is a named kwarg (video=, audio=); pass whichever you have. Both is fine.
  • No mode axis. The shape of the output is determined by cardinality: when |gen| == |ref| the samples list is pair-zipped; when |gen| < |ref| the unmatched references become role-tagged set samples for corpus-shaped metrics (FVD / FAD) without disturbing per-sample paired metrics (LPIPS / PSNR / SSIM / gt_optical_flow).

Classes

Functions

fastvideo.eval.io.inputs.as_video

as_video(x: str | Path | Tensor | Video) -> Video

Coerce path/tensor/Video → :class:Video for the pool to decode.

Path strings and :class:pathlib.Path become Video(source=str(x)); the pool then calls :func:load_video on first use. Tensors become Video(source=None, frames=x) — the pool sees .frames already populated and forwards untouched. :class:Video instances pass through.

Source code in fastvideo/eval/io/inputs.py
def as_video(x: str | Path | torch.Tensor | Video) -> Video:
    """Coerce path/tensor/Video → :class:`Video` for the pool to decode.

    Path strings and :class:`pathlib.Path` become ``Video(source=str(x))``;
    the pool then calls :func:`load_video` on first use.  Tensors become
    ``Video(source=None, frames=x)`` — the pool sees ``.frames`` already
    populated and forwards untouched.  :class:`Video` instances pass through.
    """
    if isinstance(x, Video):
        return x
    if isinstance(x, str | Path):
        return Video(source=str(x))
    if isinstance(x, torch.Tensor):
        return Video(source=None, frames=x)
    raise TypeError(f"Cannot coerce {type(x).__name__} to Video")

fastvideo.eval.io.inputs.samples_from

samples_from(*, video: PathSpec | None = None, reference: PathSpec | None = None, audio: PathSpec | None = None, reference_audio: PathSpec | None = None, text_prompt: str | None = None, text_prompts: str | Path | list[str] | None = None, fps: float | None = None, auxiliary_info: dict | list[dict] | None = None, extras: dict | list[dict] | None = None, extract_audio: bool | str | Path = False, extract_workers: int = 4) -> list[dict]

Build a samples list from path-style inputs.

Parameters

video, reference, audio, reference_audio : File path, directory of files (sorted by name), or any iterable of paths. Pass whichever modalities apply to the metrics you plan to run — they attach to sample["video"] / sample["reference"] / sample["audio"] / sample["reference_audio"] respectively. Video paths are wrapped in :class:Video so :class:VideoPool decodes them lazily in parallel; audio paths stay as strings (audio metrics each load with their own resample / preprocess). text_prompt : A single prompt string broadcast onto every sample. text_prompts : A list of strings (one per sample), or a path to a .jsonl / .json file containing per-sample prompts. fps : Scalar fps broadcast onto every sample. auxiliary_info : Single dict (broadcast) or list of dicts (zipped) for sample["auxiliary_info"] — vbench structured-prompt metrics read this. extras : Catch-all per-sample attachments. Use for metric-specific keys the dedicated kwargs don't cover (scenario, view, actions, calibration, reference_take2, ...). Pass a single dict to broadcast or a list-of-dicts to zip; the keys merge into each sample dict. extract_audio : If truthy, auto-extract audio from each video / reference source into .wav files via PyAV and attach the paths under sample["audio"] / sample["reference_audio"]. Pass a path for a persistent cache, True for a tempdir. Skipped silently for videos with no audio stream; ignored wherever audio / reference_audio is already explicit. extract_workers : Parallel workers for extract_audio.

Returns

list[dict] Canonical samples shape — hand directly to :meth:Evaluator.evaluate.

Cardinality and shape

Let N = len(generated inputs) (the agreed length of whichever of video / audio you passed). References are attached 1:1 onto the first N samples; any extras (when |ref| > N) become standalone role-tagged samples at the end of the list, so set metrics like FVD see the full reference corpus while per-sample paired metrics like LPIPS only run on the first N pairs.

Notes

"Missing" keys are simply absent from the sample dict. Metrics handle them per their own contract (sample.get(...) for optional, sample[...] raises for required, or :meth:BaseMetric._skip for opt-in skip behavior). One fat samples list with many keys can serve many metrics — each reads its subset.

Source code in fastvideo/eval/io/inputs.py
def samples_from(
    *,
    # Modality inputs — pass whichever you have, in any combination.
    video: PathSpec | None = None,
    reference: PathSpec | None = None,
    audio: PathSpec | None = None,
    reference_audio: PathSpec | None = None,
    # Per-sample attachments.  Scalars broadcast to every sample; lists
    # / jsonl paths zip per-sample.
    text_prompt: str | None = None,
    text_prompts: str | Path | list[str] | None = None,
    fps: float | None = None,
    auxiliary_info: dict | list[dict] | None = None,
    # Catch-all for exotic metric inputs (physics_iq scenarios,
    # synthetic_optical_flow actions, ...).  Single dict broadcasts;
    # list of dicts zips.  Keys merge into each sample dict.
    extras: dict | list[dict] | None = None,
    # Sugar: pull audio off the video sources via PyAV.  Pass a path
    # to use a persistent on-disk cache, ``True`` for a system tempdir.
    extract_audio: bool | str | Path = False,
    extract_workers: int = 4,
) -> list[dict]:
    """Build a samples list from path-style inputs.

    Parameters
    ----------
    video, reference, audio, reference_audio :
        File path, directory of files (sorted by name), or any iterable
        of paths.  Pass whichever modalities apply to the metrics you
        plan to run — they attach to ``sample["video"]`` /
        ``sample["reference"]`` / ``sample["audio"]`` /
        ``sample["reference_audio"]`` respectively.  Video paths are
        wrapped in :class:`Video` so :class:`VideoPool` decodes them
        lazily in parallel; audio paths stay as strings (audio metrics
        each load with their own resample / preprocess).
    text_prompt :
        A single prompt string broadcast onto every sample.
    text_prompts :
        A list of strings (one per sample), or a path to a ``.jsonl`` /
        ``.json`` file containing per-sample prompts.
    fps :
        Scalar fps broadcast onto every sample.
    auxiliary_info :
        Single dict (broadcast) or list of dicts (zipped) for
        ``sample["auxiliary_info"]`` — vbench structured-prompt
        metrics read this.
    extras :
        Catch-all per-sample attachments.  Use for metric-specific keys
        the dedicated kwargs don't cover (``scenario``, ``view``,
        ``actions``, ``calibration``, ``reference_take2``, ...).  Pass
        a single dict to broadcast or a list-of-dicts to zip; the keys
        merge into each sample dict.
    extract_audio :
        If truthy, auto-extract audio from each ``video`` /
        ``reference`` source into ``.wav`` files via PyAV and attach
        the paths under ``sample["audio"]`` / ``sample["reference_audio"]``.
        Pass a path for a persistent cache, ``True`` for a tempdir.
        Skipped silently for videos with no audio stream; ignored
        wherever ``audio`` / ``reference_audio`` is already explicit.
    extract_workers :
        Parallel workers for ``extract_audio``.

    Returns
    -------
    list[dict]
        Canonical samples shape — hand directly to
        :meth:`Evaluator.evaluate`.

    Cardinality and shape
    ---------------------
    Let ``N = len(generated inputs)`` (the agreed length of whichever
    of ``video`` / ``audio`` you passed).  References are attached
    1:1 onto the first N samples; any extras (when ``|ref| > N``)
    become standalone role-tagged samples at the end of the list, so
    set metrics like FVD see the full reference corpus while per-sample
    paired metrics like LPIPS only run on the first N pairs.

    Notes
    -----
    "Missing" keys are simply absent from the sample dict.  Metrics
    handle them per their own contract (``sample.get(...)`` for
    optional, ``sample[...]`` raises for required, or
    :meth:`BaseMetric._skip` for opt-in skip behavior).  One fat samples
    list with many keys can serve many metrics — each reads its subset.
    """
    gen_video = _expand(video, _VIDEO_EXTS) if video is not None else None
    gen_audio = _expand(audio, _AUDIO_EXTS) if audio is not None else None
    ref_video = _expand(reference, _VIDEO_EXTS) if reference is not None else None
    ref_audio = _expand(reference_audio, _AUDIO_EXTS) if reference_audio is not None else None

    if gen_video is None and gen_audio is None:
        raise ValueError("samples_from: pass at least one of video= or audio=")

    # All "generated" inputs must agree on length — they describe the
    # same N samples in different modalities.
    n_candidates = [len(x) for x in (gen_video, gen_audio) if x is not None]
    if len({*n_candidates}) != 1:
        raise ValueError(f"samples_from: generated inputs have inconsistent lengths {n_candidates}")
    n = n_candidates[0]

    prompts = _broadcast(text_prompts if text_prompts is not None else text_prompt, n, loader=_load_prompts)
    aux = _broadcast(auxiliary_info, n)
    extras_per_sample = _broadcast(extras, n)
    if (text_prompts is not None) and (text_prompt is not None):
        raise ValueError("Pass either text_prompt (broadcast) or text_prompts (per-sample), not both.")

    samples: list[dict[str, Any]] = []
    for i in range(n):
        s: dict[str, Any] = {}
        if gen_video is not None:
            s["video"] = as_video(gen_video[i])
        if gen_audio is not None:
            s["audio"] = str(gen_audio[i])
        if ref_video is not None and i < len(ref_video):
            s["reference"] = as_video(ref_video[i])
        if ref_audio is not None and i < len(ref_audio):
            s["reference_audio"] = str(ref_audio[i])
        if prompts is not None:
            s["text_prompt"] = prompts[i]
        if fps is not None:
            s["fps"] = fps
        if aux is not None:
            s["auxiliary_info"] = aux[i]
        if extras_per_sample is not None:
            s.update(extras_per_sample[i])
        samples.append(s)

    # Unmatched references → role-tagged set samples.  Per-sample paired
    # metrics skip these via the worker's role-skip rule; set metrics
    # (FVD, FAD) accumulate the features.
    if ref_video is not None and len(ref_video) > n:
        for v in ref_video[n:]:
            samples.append({"video": as_video(v), "role": "reference"})
    if ref_audio is not None and len(ref_audio) > n:
        for a in ref_audio[n:]:
            samples.append({"audio": str(a), "role": "reference"})

    if extract_audio:
        _attach_extracted_audio(samples, extract_audio, extract_workers)
    return samples