Skip to content

vbench

VBench metrics. Bootstraps upstream submodule on sys.path and installs runtime compat shims for modern torch/transformers/numpy/timm.

The upstream vbench source lives as a git submodule at fastvideo/third_party/eval/vbench (pinned to a specific Vchitect/VBench SHA). We do not pip-install it — we only need its Python modules importable. Its runtime deps (clip, transformers, etc.) are already in FastVideo's main env.

Compat with modern dependency versions is achieved at import time, in this file, instead of via on-disk patches to upstream files. Each shim below corresponds to a specific drift between vbench's pinned-2023 deps and FastVideo's current pins. Adding a new shim is preferable to editing the submodule.

Modules

fastvideo.eval.metrics.vbench.aesthetic_quality

Modules

fastvideo.eval.metrics.vbench.aesthetic_quality.metric

VBench Aesthetic Quality — CLIP ViT-L/14 + LAION aesthetic predictor.

Encodes frames through CLIP, passes L2-normalized features through a linear aesthetic head (768 → 1), and averages scores / 10.

Classes
Functions

fastvideo.eval.metrics.vbench.appearance_style

Modules

fastvideo.eval.metrics.vbench.appearance_style.metric

VBench Appearance Style — CLIP ViT-B/32 text-image alignment.

Per-frame cosine similarity between CLIP image features and a text prompt describing the expected style. Requires sample["text_prompt"].

Classes
Functions

fastvideo.eval.metrics.vbench.background_consistency

Modules

fastvideo.eval.metrics.vbench.background_consistency.metric

VBench Background Consistency — CLIP ViT-B/32 temporal feature similarity.

Measures background stability via cosine similarity of CLIP features between consecutive frames and the first frame.

Classes
Functions

fastvideo.eval.metrics.vbench.color

Modules

fastvideo.eval.metrics.vbench.color.metric

VBench Color — GRiT dense captioning for color accuracy.

Detects the target object via GRiT and checks if the expected color keyword appears in the object's caption. Score = frames_with_correct_color / frames_with_object_detected.

Classes
Functions

fastvideo.eval.metrics.vbench.dynamic_degree

Modules

fastvideo.eval.metrics.vbench.dynamic_degree.metric

VBench Dynamic Degree — RAFT optical flow motion detection.

For each consecutive frame pair, computes optical flow via RAFT and takes the mean of the top 5% flow magnitudes. If enough pairs exceed an adaptive threshold, the video is classified as dynamic (1.0) vs static (0.0).

Classes
fastvideo.eval.metrics.vbench.dynamic_degree.metric.DynamicDegreeMetric
DynamicDegreeMetric()

Bases: BaseMetric

Source code in fastvideo/eval/metrics/vbench/dynamic_degree/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
    self._chunk_size = 16
Functions

fastvideo.eval.metrics.vbench.human_action

Modules

fastvideo.eval.metrics.vbench.human_action.metric

VBench Human Action — UMT ViT-L/16 action classification (Kinetics-400).

Classifies human actions in 16-frame clips. Top-5 predictions with confidence >= 0.85 are compared against the ground-truth action label. Score = 1.0 if match found, 0.0 otherwise.

Classes
Functions

fastvideo.eval.metrics.vbench.imaging_quality

Modules

fastvideo.eval.metrics.vbench.imaging_quality.metric

VBench Imaging Quality — MUSIQ-based per-frame technical quality.

Uses MUSIQ (Multi-Scale Image Quality) from pyiqa. Frames are resized so the longer side is at most 512px. Score = mean(MUSIQ_scores) / 100.

Classes
Functions

fastvideo.eval.metrics.vbench.motion_smoothness

Modules

fastvideo.eval.metrics.vbench.motion_smoothness.metric

VBench Motion Smoothness — AMT-S frame interpolation quality.

Takes every-other frame, uses AMT-S to interpolate the missing middle frames, then compares interpolated vs actual frames. Score = (255 - mean_pixel_diff) / 255. Higher = smoother motion.

Classes
fastvideo.eval.metrics.vbench.motion_smoothness.metric.MotionSmoothnessMetric
MotionSmoothnessMetric()

Bases: BaseMetric

Source code in fastvideo/eval/metrics/vbench/motion_smoothness/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
    self._embt: Any = None
    self._chunk_size = 8
Functions

fastvideo.eval.metrics.vbench.multiple_objects

Modules

fastvideo.eval.metrics.vbench.multiple_objects.metric

VBench Multiple Objects — GRiT detection for dual-object presence.

Checks if BOTH target objects are detected in each of 16 sampled frames. Score = matching_frames / total_frames.

Classes
Functions

fastvideo.eval.metrics.vbench.object_class

Modules

fastvideo.eval.metrics.vbench.object_class.metric

VBench Object Class — GRiT object detection for class matching.

Checks if a target object class is detected in each of 16 sampled frames. Score = matching_frames / total_frames.

Classes
Functions

fastvideo.eval.metrics.vbench.overall_consistency

Modules

fastvideo.eval.metrics.vbench.overall_consistency.metric

VBench Overall Consistency — ViCLIP text-video alignment.

Encodes 8 sampled video frames via ViCLIP vision encoder and a text prompt via ViCLIP text encoder, then computes cosine similarity.

Classes
Functions

fastvideo.eval.metrics.vbench.scene

Modules

fastvideo.eval.metrics.vbench.scene.metric

VLM-based scene matching using AVoCaDO (Qwen2.5-Omni).

Replaces VBench's Tag2Text-based scene metric with a modern VLM caption. The algorithm follows VBench: 1. Caption the video 2. Check if all scene keywords appear in the caption 3. Score = 1.0 if all match, 0.0 otherwise

Unlike VBench (which captions each frame separately with Tag2Text), AVoCaDO captions the entire video in one pass with rich natural language, making the keyword check more robust.

Classes
fastvideo.eval.metrics.vbench.scene.metric.SceneMetric
SceneMetric(model_path: str = 'AVoCaDO-Captioner/AVoCaDO')

Bases: BaseMetric

Source code in fastvideo/eval/metrics/vbench/scene/metric.py
def __init__(self, model_path: str = "AVoCaDO-Captioner/AVoCaDO") -> None:
    super().__init__()
    self._model: Any = None
    self._processor: Any = None
    self._model_path = model_path
Functions

fastvideo.eval.metrics.vbench.spatial_relationship

Modules

fastvideo.eval.metrics.vbench.spatial_relationship.metric

VBench Spatial Relationship — GRiT detection + bbox position scoring.

Detects two target objects via GRiT and checks if their bounding boxes satisfy the expected spatial relationship (left/right/above/below).

Classes
Functions

fastvideo.eval.metrics.vbench.subject_consistency

Modules

fastvideo.eval.metrics.vbench.subject_consistency.metric

VBench Subject Consistency — DINO ViT-B/16 temporal feature similarity.

Measures how well the main subject maintains its appearance throughout the video via cosine similarity of DINO features between consecutive frames and the first frame.

Classes
Functions

fastvideo.eval.metrics.vbench.temporal_flickering

Modules

fastvideo.eval.metrics.vbench.temporal_flickering.metric

VBench Temporal Flickering — measures frame-to-frame stability.

Score = (255 - mean_MAE) / 255, where MAE is computed between consecutive frames in uint8 [0, 255] space. Higher = less flickering.

Classes
Functions

fastvideo.eval.metrics.vbench.temporal_style

Modules

fastvideo.eval.metrics.vbench.temporal_style.metric

VBench Temporal Style — ViCLIP text-video alignment (style focus).

Identical logic to overall_consistency — same ViCLIP cosine similarity. The difference is semantic: overall_consistency measures general prompt alignment while temporal_style measures style consistency over time. VBench uses different prompts for each from its metadata JSON.

Functions