ajd12342/paraspeechclap-eval-intrinsic
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ajd12342/paraspeechclap-eval-intrinsic
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-nc-sa-4.0
tags:
- speech
- audio
- style
- CLAP
- dual-encoder
- evaluation
- benchmark
- intrinsic
- speaker-level
- classification
source_datasets:
- ajd12342/paraspeechcaps
task_categories:
- audio-classification
size_categories:
- 1K<n<10K
dataset_info:
features:
- name: source
dtype: string
- name: relative_audio_path
dtype: string
- name: text_description
dtype: string
- name: classification_tag
dtype: string
- name: transcription
dtype: string
- name: all_tags
sequence: string
- name: speakerid
dtype: string
- name: name
dtype: string
- name: duration
dtype: float64
- name: gender
dtype: string
- name: accent
dtype: string
- name: pitch
dtype: string
- name: speaking_rate
dtype: string
- name: noise
dtype: string
- name: utterance_pitch_mean
dtype: float64
- name: snr
dtype: float64
- name: phonemes
dtype: string
splits:
- name: test
num_bytes: 1821680
num_examples: 2819
- name: classification_clarity
num_bytes: 482928
num_examples: 875
- name: classification_pitch
num_bytes: 811642
num_examples: 1438
- name: classification_rhythm
num_bytes: 1117149
num_examples: 1956
- name: classification_texture
num_bytes: 598245
num_examples: 1093
- name: classification_volume
num_bytes: 693780
num_examples: 1216
download_size: 1924265
dataset_size: 5525424
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
- split: classification_clarity
path: data/classification_clarity-*
- split: classification_pitch
path: data/classification_pitch-*
- split: classification_rhythm
path: data/classification_rhythm-*
- split: classification_texture
path: data/classification_texture-*
- split: classification_volume
path: data/classification_volume-*
---
# Intrinsic Evaluation Dataset for ParaSpeechCLAP Models
Evaluation dataset for **intrinsic (speaker-level)** style attributes, from the paper:
*ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining*
Anuj Diwan, Eunsol Choi, David Harwath
## Overview
This dataset is the **intrinsic evaluation dataset** for the ParaSpeechCLAP model family. It contains speech clips paired with intrinsic-tag style captions, drawn from the **VoxCeleb** portion of the [ParaSpeechCaps](https://huggingface.co/datasets/ajd12342/paraspeechcaps) holdout set. The captions are generated from solely the intrinsic and basic tags.
The dataset has **6 splits**: `test` for **retrieval** evaluation, and `classification_clarity`, `classification_pitch`, `classification_rhythm`, `classification_texture`, `classification_volume` for per-attribute **classification** evaluation. The key column for retrieval is `text_description` and the key column for classification is `classification_tag`. Classification is evaluated per-attribute using the template *"A person is speaking in a {label} style"*:
## Installation
Install the `datasets` package to load the dataset:
```bash
pip install datasets
```
To run retrieval and classification evaluation with ParaSpeechCLAP, install the [ParaSpeechCLAP GitHub repository](https://github.com/ajd12342/paraspeechclap):
```bash
git clone https://github.com/ajd12342/paraspeechclap.git
cd paraspeechclap
pip install -r requirements.txt
```
### Setting up audio files
The dataset contains a `relative_audio_path` column but not the audio files themselves. Resolving audio paths requires specifying `data.audio_root`, a common root directory organized as `${audio_root}/{source}/`, where `{source}` matches the value of the `source` column in the dataset.
This dataset uses the **VoxCeleb** source. Follow the [ParaSpeechCaps audio setup instructions](https://github.com/ajd12342/paraspeechcaps/tree/main/dataset#22-processing-dataset-audio) for VoxCeleb, with the following adjustment: instead of placing VoxCeleb at its own `${voxceleb_root}`, place it under a common root:
- `${audio_root}/voxceleb/` (instead of `${voxceleb_root}`)
Then pass `data.audio_root=${audio_root}` when running any ParaSpeechCLAP script.
## Usage with ParaSpeechCLAP
### Retrieval evaluation
```bash
python scripts/evaluate_retrieval.py \
--config-name eval/retrieval \
checkpoint_path=./checkpoints/paraspeechclap-intrinsic.pth.tar \
data.dataset_name=ajd12342/paraspeechclap-eval-intrinsic \
data.audio_root=/path/to/audio_root \
meta.results=./results_retrieval/paraspeechclap-eval-intrinsic/ajd12342-paraspeechclap-intrinsic
```
### Classification evaluation
```bash
# Evaluate all attributes
for attr in clarity pitch rhythm texture volume; do
python scripts/evaluate_classification.py \
--config-name eval/classification/${attr} \
checkpoint_path=./checkpoints/paraspeechclap-intrinsic.pth.tar \
data.audio_root=/path/to/audio_root \
meta.results=./results_classification/paraspeechclap-eval-intrinsic/ajd12342-paraspeechclap-intrinsic/${attr}
done
```
### Loading the dataset
```python
from datasets import load_dataset
# Retrieval split
retrieval = load_dataset("ajd12342/paraspeechclap-eval-intrinsic", split="test")
print(f"Retrieval clips: {len(retrieval)}")
print(f"Unique prompts: {len(set(retrieval['text_description']))}")
# Classification split (e.g., pitch)
cls_pitch = load_dataset("ajd12342/paraspeechclap-eval-intrinsic", split="classification_pitch")
print(f"Classification pitch clips: {len(cls_pitch)}")
print(f"Labels: {set(cls_pitch['classification_tag'])}")
```
## Related Resources
- **GitHub Repository:** [https://github.com/ajd12342/paraspeechclap](https://github.com/ajd12342/paraspeechclap)
- **Models:** [ajd12342/paraspeechclap-intrinsic](https://huggingface.co/ajd12342/paraspeechclap-intrinsic), [ajd12342/paraspeechclap-combined](https://huggingface.co/ajd12342/paraspeechclap-combined), and [ajd12342/paraspeechclap-situational](https://huggingface.co/ajd12342/paraspeechclap-situational)
- **Parent Dataset:** [https://huggingface.co/datasets/ajd12342/paraspeechcaps](https://huggingface.co/datasets/ajd12342/paraspeechcaps)
- **Training Dataset:** [https://huggingface.co/datasets/ajd12342/paraspeechcaps-intrinsic-train](https://huggingface.co/datasets/ajd12342/paraspeechcaps-intrinsic-train)
- **Evaluation Datasets:** [https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-combined](https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-combined) and [https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-situational](https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-situational)
## Citation
```bibtex
@inproceedings{diwan2026paraspeechclap,
title={{ParaSpeechCLAP}: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining},
author={Diwan, Anuj and Choi, Eunsol and Harwath, David},
journal={Under Review},
year={2026}
}
```
提供机构:
ajd12342



