Skip to content

metric

VLM-based scene matching using AVoCaDO (Qwen2.5-Omni).

Replaces VBench's Tag2Text-based scene metric with a modern VLM caption. The algorithm follows VBench: 1. Caption the video 2. Check if all scene keywords appear in the caption 3. Score = 1.0 if all match, 0.0 otherwise

Unlike VBench (which captions each frame separately with Tag2Text), AVoCaDO captions the entire video in one pass with rich natural language, making the keyword check more robust.

Classes

fastvideo.eval.metrics.vbench.scene.metric.SceneMetric

SceneMetric(model_path: str = 'AVoCaDO-Captioner/AVoCaDO')

Bases: BaseMetric

Source code in fastvideo/eval/metrics/vbench/scene/metric.py
def __init__(self, model_path: str = "AVoCaDO-Captioner/AVoCaDO") -> None:
    super().__init__()
    self._model: Any = None
    self._processor: Any = None
    self._model_path = model_path

Functions