Skip to content

clap_score

Modules

fastvideo.eval.metrics.audio.clap_score.metric

CLAP Score (CS) — text-audio cosine similarity.

Uses HuggingFace transformers.ClapModel with the laion/clap-htsat-fused checkpoint, the closest HF mirror of the 630k-audioset-fusion-best.pt weights that hkchengrex/av-benchmark and the V2A literature use. The fully byte-exact comparison still requires the laion_clap pip package (which pins numpy<2 and forces a fastvideo-wide downgrade — so we don't depend on it). Numbers from this metric are in the same ballpark as av-benchmark's LAION_CLAP.

Classes

fastvideo.eval.metrics.audio.clap_score.metric.ClapScoreMetric
ClapScoreMetric(model_name: str = DEFAULT_CLAP_REPO)

Bases: BaseMetric

CLAP score — text-audio cosine similarity via HF CLAP.

Source code in fastvideo/eval/metrics/audio/clap_score/metric.py
def __init__(self, model_name: str = DEFAULT_CLAP_REPO) -> None:
    super().__init__()
    self._model_name = model_name
    self._model: Any = None
    self._processor: Any = None

Functions