audio
¶
Modules¶
fastvideo.eval.metrics.audio.audiobox_aesthetics
¶
Modules¶
fastvideo.eval.metrics.audio.audiobox_aesthetics.metric
¶
AudioBox Aesthetics (CE, CU, PC, PQ).
Thin wrapper around Meta's audiobox_aesthetics predictor. Returns
four per-clip dimensions (CE — Content Enjoyment, CU — Content
Usefulness, PC — Production Complexity, PQ — Production Quality);
score exposes PQ, the dimension V2A papers typically report on.
The remaining three are surfaced under details.
The earlier Verse-Bench combined score (CE + CU + PQ + (11 − PC)) / 4
is non-standard and is deliberately not used.
Classes¶
fastvideo.eval.metrics.audio.audiobox_aesthetics.metric.AudioBoxAestheticsMetric
¶
Bases: BaseMetric
AudioBox Aesthetics: PQ as the primary score, CE/CU/PC/PQ in details.
Source code in fastvideo/eval/metrics/audio/audiobox_aesthetics/metric.py
Functions¶
fastvideo.eval.metrics.audio.clap_score
¶
Modules¶
fastvideo.eval.metrics.audio.clap_score.metric
¶
CLAP Score (CS) — text-audio cosine similarity.
Uses HuggingFace transformers.ClapModel with the
laion/clap-htsat-fused checkpoint, the closest HF mirror of the
630k-audioset-fusion-best.pt weights that
hkchengrex/av-benchmark and the V2A literature use. The fully
byte-exact comparison still requires the laion_clap pip package
(which pins numpy<2 and forces a fastvideo-wide downgrade — so
we don't depend on it). Numbers from this metric are in the same
ballpark as av-benchmark's LAION_CLAP.
Classes¶
fastvideo.eval.metrics.audio.clap_score.metric.ClapScoreMetric
¶ClapScoreMetric(model_name: str = DEFAULT_CLAP_REPO)
Bases: BaseMetric
CLAP score — text-audio cosine similarity via HF CLAP.
Source code in fastvideo/eval/metrics/audio/clap_score/metric.py
Functions¶
fastvideo.eval.metrics.audio.desync
¶
Modules¶
fastvideo.eval.metrics.audio.desync.metric
¶
Synchformer DeSync — average audio-video desynchronization in seconds.
Per-sample. Ports hkchengrex/av-benchmark's DeSync against the
24-01-04T16-39-21 Synchformer checkpoint (the one MMAudio reports
against). argmax over the 21-class grid in [-2, +2] seconds for
the first 14 and last 14 segments separately; lower is better, 0 = ideal.
Synchformer's transformer carries a fixed positional embedding sized for
~14 segments per modality, so keep input clips in the 3-10 s range —
shorter clips raise in _segment_video and longer clips ignore the
middle.
Classes¶
fastvideo.eval.metrics.audio.desync.metric.DeSyncMetric
¶DeSyncMetric(src_fps: float | None = None)
Bases: BaseMetric
Per-sample Synchformer DeSync magnitude in seconds (lower is better).
Source code in fastvideo/eval/metrics/audio/desync/metric.py
Functions¶
fastvideo.eval.metrics.audio.frechet_distance
¶
Modules¶
fastvideo.eval.metrics.audio.frechet_distance.metric
¶
Frechet Audio Distance over PaSST embeddings (FD_PaSST).
Corpus-vs-corpus Fréchet distance between Gaussian moments of two PaSST
768-d embedding sets. Ports av_bench.metrics.fad.compute_fd 1:1.
References are supplied per-sample via reference_audio /
role="reference", or once from a .pt cache at
$FASTVIDEO_FAD_REF_FEATURES. Skips when either side has fewer than
two finite embeddings.
Classes¶
fastvideo.eval.metrics.audio.frechet_distance.metric.FrechetAudioDistanceMetric
¶
Bases: BaseMetric
Corpus-vs-corpus Frechet Audio Distance with PaSST embeddings.
Source code in fastvideo/eval/metrics/audio/frechet_distance/metric.py
Functions¶
fastvideo.eval.metrics.audio.imagebind_score
¶
Modules¶
fastvideo.eval.metrics.audio.imagebind_score.metric
¶
ImageBind audio↔video cosine similarity (IB-Score).
Per-sample. Reads sample["video_path"] (or sample["video"].source
for a :class:Video wrapper) and sample["audio"]; ImageBind decodes
its own clips so the path is required, not the pool-decoded tensor.
Classes¶
fastvideo.eval.metrics.audio.imagebind_score.metric.ImageBindScoreMetric
¶
Bases: BaseMetric
ImageBind audio↔video cosine similarity, per-sample.
Source code in fastvideo/eval/metrics/audio/imagebind_score/metric.py
Functions¶
fastvideo.eval.metrics.audio.kl_divergence
¶
Modules¶
fastvideo.eval.metrics.audio.kl_divergence.metric
¶
PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.
Ports av_bench.metrics.kl.compute_kl 1:1. The primary score is
the softmax variant (reported as "MKL" / "KL_PaSST" by V2A papers);
the sigmoid variant is exposed in details["kl_sigmoid"].
Classes¶
fastvideo.eval.metrics.audio.kl_divergence.metric.KLDivergenceMetric
¶
Bases: BaseMetric
PaSST KL divergence KL(gt || pred) on AudioSet-527 logits.
Per-sample. Requires sample["audio"] (generated) and
sample["reference_audio"] (ground truth).
Source code in fastvideo/eval/metrics/audio/kl_divergence/metric.py
Functions¶
fastvideo.eval.metrics.audio.wer
¶
Modules¶
fastvideo.eval.metrics.audio.wer.metric
¶
Word Error Rate for generated audio.
Per-sample. Default backend is Whisper-base; glm_asr (vendored under
third_party/eval/glmasr/) and sensevoice (FunASR, install-on-
demand) are selectable via asr_backend=. Text is normalized per the
MagiHuman convention and CJK inputs are scored at the character level.
Classes¶
fastvideo.eval.metrics.audio.wer.metric.WERMetric
¶WERMetric(asr_backend: str = 'whisper', model_name: str | None = None, instruction: str = 'Please transcribe this audio into text')
Bases: BaseMetric
WER metric with selectable ASR backend.
Sample fields: audio, reference_text (or text_prompt),
optional language. instruction is passed to GLM-ASR.