ajd12342/paraspeechclap-eval-combined
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ajd12342/paraspeechclap-eval-combined
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-nc-sa-4.0
tags:
- speech
- audio
- style
- CLAP
- dual-encoder
- evaluation
- benchmark
- compositional
- intrinsic
- situational
source_datasets:
- ajd12342/paraspeechcaps
task_categories:
- audio-classification
size_categories:
- 1K<n<10K
dataset_info:
features:
- name: source
dtype: string
- name: relative_audio_path
dtype: string
- name: text_description
dtype: string
- name: transcription
dtype: string
- name: intrinsic_tags
sequence: string
- name: situational_tags
dtype: string
- name: basic_tags
sequence: string
- name: all_tags
sequence: string
- name: speakerid
dtype: string
- name: name
dtype: string
- name: duration
dtype: float64
- name: gender
dtype: string
- name: accent
dtype: string
- name: pitch
dtype: string
- name: speaking_rate
dtype: string
- name: noise
dtype: string
- name: utterance_pitch_mean
dtype: float32
- name: snr
dtype: float64
- name: phonemes
dtype: string
splits:
- name: test
num_bytes: 1326439
num_examples: 1432
download_size: 352784
dataset_size: 1326439
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
---
# Combined Evaluation Dataset for ParaSpeechCLAP Models
Evaluation dataset for **combined (compositional)** style attributes, used in the paper:
*ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining*
Anuj Diwan, Eunsol Choi, David Harwath
## Overview
This dataset is the **combined evaluation set** for the ParaSpeechCLAP model family. It contains speech clips paired with **compositional style captions that include both intrinsic and situational tags**, drawn from the **Expresso-EARS** portion of the [ParaSpeechCaps](https://huggingface.co/datasets/ajd12342/paraspeechcaps) holdout set. It is used to perform **retrieval** evaluation. This version uses the **original captions** containing both intrinsic and situational tags from the [ParaSpeechCaps](https://huggingface.co/datasets/ajd12342/paraspeechcaps) holdout set.
## Installation
Install the `datasets` package to load the dataset:
```bash
pip install datasets
```
To run retrieval evaluation with ParaSpeechCLAP, install the [ParaSpeechCLAP GitHub repository](https://github.com/ajd12342/paraspeechclap):
```bash
git clone https://github.com/ajd12342/paraspeechclap.git
cd paraspeechclap
pip install -r requirements.txt
```
### Setting up audio files
The dataset contains a `relative_audio_path` column but not the audio files themselves. Resolving audio paths requires specifying `data.audio_root`, a common root directory organized as `${audio_root}/{source}/`, where `{source}` matches the value of the `source` column in the dataset.
This dataset uses the **Expresso** and **EARS** sources. Follow the [ParaSpeechCaps audio setup instructions](https://github.com/ajd12342/paraspeechcaps/tree/main/dataset#22-processing-dataset-audio) for those sources, with the following adjustment: instead of placing each source at its own root directory, place them under a common root:
- `${audio_root}/expresso/` (instead of `${expresso_root}`)
- `${audio_root}/ears/` (instead of `${ears_root}`)
Then pass `data.audio_root=${audio_root}` when running any ParaSpeechCLAP script.
## Usage with ParaSpeechCLAP
### Retrieval evaluation
```bash
python scripts/evaluate_retrieval.py \
--config-name eval/retrieval \
checkpoint_path=./checkpoints/paraspeechclap-combined.pth.tar \
data.dataset_name=ajd12342/paraspeechclap-eval-combined \
data.audio_root=/path/to/audio_root \
meta.results=./results_retrieval/paraspeechclap-eval-combined/ajd12342-paraspeechclap-combined
```
### Loading the dataset
```python
from datasets import load_dataset
dataset = load_dataset("ajd12342/paraspeechclap-eval-combined", split="test")
print(f"Number of clips: {len(dataset)}")
print(f"Number of unique prompts: {len(set(dataset['text_description']))}")
print(dataset[0])
```
## Related Resources
- **GitHub Repository:** [https://github.com/ajd12342/paraspeechclap](https://github.com/ajd12342/paraspeechclap)
- **Models:** [ajd12342/paraspeechclap-intrinsic](https://huggingface.co/ajd12342/paraspeechclap-intrinsic), [ajd12342/paraspeechclap-situational](https://huggingface.co/ajd12342/paraspeechclap-situational) and [ajd12342/paraspeechclap-combined](https://huggingface.co/ajd12342/paraspeechclap-combined)
- **Parent Dataset:** [https://huggingface.co/datasets/ajd12342/paraspeechcaps](https://huggingface.co/datasets/ajd12342/paraspeechcaps)
- **Training Datasets:** [https://huggingface.co/datasets/ajd12342/paraspeechcaps-intrinsic-train](https://huggingface.co/datasets/ajd12342/paraspeechcaps-intrinsic-train) and [https://huggingface.co/datasets/ajd12342/paraspeechcaps-situational-train](https://huggingface.co/datasets/ajd12342/paraspeechcaps-situational-train)
- **Evaluation Datasets:** [https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-intrinsic](https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-intrinsic), [https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-situational](https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-situational) and [https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-combined](https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-combined)
## Citation
```bibtex
@inproceedings{diwan2026paraspeechclap,
title={{ParaSpeechCLAP}: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining},
author={Diwan, Anuj and Choi, Eunsol and Harwath, David},
journal={Under Review},
year={2026}
}
```
提供机构:
ajd12342



