Skip to content

audio

Modules

fastvideo.eval.metrics.audio.audiobox_aesthetics

Modules

fastvideo.eval.metrics.audio.audiobox_aesthetics.metric

AudioBox Aesthetics (CE, CU, PC, PQ).

Thin wrapper around Meta's audiobox_aesthetics predictor. Returns four per-clip dimensions (CE — Content Enjoyment, CU — Content Usefulness, PC — Production Complexity, PQ — Production Quality); score exposes PQ, the dimension V2A papers typically report on. The remaining three are surfaced under details.

The earlier Verse-Bench combined score (CE + CU + PQ + (11 − PC)) / 4 is non-standard and is deliberately not used.

Classes
fastvideo.eval.metrics.audio.audiobox_aesthetics.metric.AudioBoxAestheticsMetric
AudioBoxAestheticsMetric()

Bases: BaseMetric

AudioBox Aesthetics: PQ as the primary score, CE/CU/PC/PQ in details.

Source code in fastvideo/eval/metrics/audio/audiobox_aesthetics/metric.py
def __init__(self) -> None:
    super().__init__()
    self._predictor: Any = None
Functions

fastvideo.eval.metrics.audio.clap_score

Modules

fastvideo.eval.metrics.audio.clap_score.metric

CLAP Score (CS) — text-audio cosine similarity.

Uses HuggingFace transformers.ClapModel with the laion/clap-htsat-fused checkpoint, the closest HF mirror of the 630k-audioset-fusion-best.pt weights that hkchengrex/av-benchmark and the V2A literature use. The fully byte-exact comparison still requires the laion_clap pip package (which pins numpy<2 and forces a fastvideo-wide downgrade — so we don't depend on it). Numbers from this metric are in the same ballpark as av-benchmark's LAION_CLAP.

Classes
fastvideo.eval.metrics.audio.clap_score.metric.ClapScoreMetric
ClapScoreMetric(model_name: str = DEFAULT_CLAP_REPO)

Bases: BaseMetric

CLAP score — text-audio cosine similarity via HF CLAP.

Source code in fastvideo/eval/metrics/audio/clap_score/metric.py
def __init__(self, model_name: str = DEFAULT_CLAP_REPO) -> None:
    super().__init__()
    self._model_name = model_name
    self._model: Any = None
    self._processor: Any = None
Functions

fastvideo.eval.metrics.audio.desync

Modules

fastvideo.eval.metrics.audio.desync.metric

Synchformer DeSync — average audio-video desynchronization in seconds.

Per-sample. Ports hkchengrex/av-benchmark's DeSync against the 24-01-04T16-39-21 Synchformer checkpoint (the one MMAudio reports against). argmax over the 21-class grid in [-2, +2] seconds for the first 14 and last 14 segments separately; lower is better, 0 = ideal.

Synchformer's transformer carries a fixed positional embedding sized for ~14 segments per modality, so keep input clips in the 3-10 s range — shorter clips raise in _segment_video and longer clips ignore the middle.

Classes
fastvideo.eval.metrics.audio.desync.metric.DeSyncMetric
DeSyncMetric(src_fps: float | None = None)

Bases: BaseMetric

Per-sample Synchformer DeSync magnitude in seconds (lower is better).

Source code in fastvideo/eval/metrics/audio/desync/metric.py
def __init__(self, src_fps: float | None = None) -> None:
    super().__init__()
    # Defaults to 25 fps (Synchformer's target) when the pool's decode fps is unknown.
    self._src_fps = src_fps
    self._model: Any = None
    self._mel: Any = None
    self._grid: torch.Tensor | None = None
    self._resample_cache: dict[int, Any] = {}
Functions

fastvideo.eval.metrics.audio.frechet_distance

Modules

fastvideo.eval.metrics.audio.frechet_distance.metric

Frechet Audio Distance over PaSST embeddings (FD_PaSST).

Corpus-vs-corpus Fréchet distance between Gaussian moments of two PaSST 768-d embedding sets. Ports av_bench.metrics.fad.compute_fd 1:1.

References are supplied per-sample via reference_audio / role="reference", or once from a .pt cache at $FASTVIDEO_FAD_REF_FEATURES. Skips when either side has fewer than two finite embeddings.

Classes
fastvideo.eval.metrics.audio.frechet_distance.metric.FrechetAudioDistanceMetric
FrechetAudioDistanceMetric()

Bases: BaseMetric

Corpus-vs-corpus Frechet Audio Distance with PaSST embeddings.

Source code in fastvideo/eval/metrics/audio/frechet_distance/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
    self._gen_buf: list[np.ndarray] = []
    self._ref_buf: list[np.ndarray] = []
    self._cached_ref_path: str | None = os.environ.get(REF_FEATURES_ENV)
    self._cached_ref_mu: np.ndarray | None = None
    self._cached_ref_sigma: np.ndarray | None = None
    self._n_cached_ref: int = 0
Functions

fastvideo.eval.metrics.audio.imagebind_score

Modules

fastvideo.eval.metrics.audio.imagebind_score.metric

ImageBind audio↔video cosine similarity (IB-Score).

Per-sample. Reads sample["video_path"] (or sample["video"].source for a :class:Video wrapper) and sample["audio"]; ImageBind decodes its own clips so the path is required, not the pool-decoded tensor.

Classes
fastvideo.eval.metrics.audio.imagebind_score.metric.ImageBindScoreMetric
ImageBindScoreMetric()

Bases: BaseMetric

ImageBind audio↔video cosine similarity, per-sample.

Source code in fastvideo/eval/metrics/audio/imagebind_score/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
Functions

fastvideo.eval.metrics.audio.kl_divergence

Modules

fastvideo.eval.metrics.audio.kl_divergence.metric

PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.

Ports av_bench.metrics.kl.compute_kl 1:1. The primary score is the softmax variant (reported as "MKL" / "KL_PaSST" by V2A papers); the sigmoid variant is exposed in details["kl_sigmoid"].

Classes
fastvideo.eval.metrics.audio.kl_divergence.metric.KLDivergenceMetric
KLDivergenceMetric()

Bases: BaseMetric

PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.

Per-sample. Requires sample["audio"] (generated) and sample["reference_audio"] (ground truth).

Source code in fastvideo/eval/metrics/audio/kl_divergence/metric.py
def __init__(self) -> None:
    super().__init__()
    self._model: Any = None
Functions

fastvideo.eval.metrics.audio.wer

Modules

fastvideo.eval.metrics.audio.wer.metric

Word Error Rate for generated audio.

Per-sample. Default backend is Whisper-base; glm_asr (vendored under third_party/eval/glmasr/) and sensevoice (FunASR, install-on- demand) are selectable via asr_backend=. Text is normalized per the MagiHuman convention and CJK inputs are scored at the character level.

Classes
fastvideo.eval.metrics.audio.wer.metric.WERMetric
WERMetric(asr_backend: str = 'whisper', model_name: str | None = None, instruction: str = 'Please transcribe this audio into text')

Bases: BaseMetric

WER metric with selectable ASR backend.

Sample fields: audio, reference_text (or text_prompt), optional language. instruction is passed to GLM-ASR.

Source code in fastvideo/eval/metrics/audio/wer/metric.py
def __init__(
    self,
    asr_backend: str = "whisper",
    model_name: str | None = None,
    instruction: str = "Please transcribe this audio into text",
) -> None:
    super().__init__()
    if asr_backend not in ("whisper", "glm_asr", "sensevoice"):
        raise ValueError(f"Unknown ASR backend '{asr_backend}'. "
                         "Supported: ['whisper', 'glm_asr', 'sensevoice']")
    self._asr_backend = asr_backend
    self._model_name = model_name
    self._instruction = instruction
    self._model: Any = None
    self._processor: Any = None
Functions