Skip to content

metric

Synchformer DeSync — average audio-video desynchronization in seconds.

Per-sample. Ports hkchengrex/av-benchmark's DeSync against the 24-01-04T16-39-21 Synchformer checkpoint (the one MMAudio reports against). argmax over the 21-class grid in [-2, +2] seconds for the first 14 and last 14 segments separately; lower is better, 0 = ideal.

Synchformer's transformer carries a fixed positional embedding sized for ~14 segments per modality, so keep input clips in the 3-10 s range — shorter clips raise in _segment_video and longer clips ignore the middle.

Classes

fastvideo.eval.metrics.audio.desync.metric.DeSyncMetric

DeSyncMetric(src_fps: float | None = None)

Bases: BaseMetric

Per-sample Synchformer DeSync magnitude in seconds (lower is better).

Source code in fastvideo/eval/metrics/audio/desync/metric.py
def __init__(self, src_fps: float | None = None) -> None:
    super().__init__()
    # Defaults to 25 fps (Synchformer's target) when the pool's decode fps is unknown.
    self._src_fps = src_fps
    self._model: Any = None
    self._mel: Any = None
    self._grid: torch.Tensor | None = None
    self._resample_cache: dict[int, Any] = {}

Functions