metric
¶
CLAP Score (CS) — text-audio cosine similarity.
Uses HuggingFace transformers.ClapModel with the
laion/clap-htsat-fused checkpoint, the closest HF mirror of the
630k-audioset-fusion-best.pt weights that
hkchengrex/av-benchmark and the V2A literature use. The fully
byte-exact comparison still requires the laion_clap pip package
(which pins numpy<2 and forces a fastvideo-wide downgrade — so
we don't depend on it). Numbers from this metric are in the same
ballpark as av-benchmark's LAION_CLAP.
Classes¶
fastvideo.eval.metrics.audio.clap_score.metric.ClapScoreMetric
¶
ClapScoreMetric(model_name: str = DEFAULT_CLAP_REPO)
Bases: BaseMetric
CLAP score — text-audio cosine similarity via HF CLAP.