ajd12342/paraspeechcaps-intrinsic-train

Name: ajd12342/paraspeechcaps-intrinsic-train
Creator: ajd12342
Published: 2026-04-06 01:12:25
License: 暂无描述

Hugging Face2026-04-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ajd12342/paraspeechcaps-intrinsic-train

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-nc-sa-4.0 tags: - speech - audio - style - CLAP - dual-encoder - contrastive-learning - intrinsic - speaker-level source_datasets: - ajd12342/paraspeechcaps task_categories: - audio-classification size_categories: - 100K<n<1M dataset_info: features: - name: source dtype: string - name: relative_audio_path dtype: string - name: text_description sequence: string - name: transcription dtype: string - name: intrinsic_tags sequence: string - name: situational_tags sequence: string - name: basic_tags sequence: string - name: all_tags sequence: string - name: speakerid dtype: string - name: name dtype: string - name: duration dtype: float64 - name: gender dtype: string - name: accent dtype: string - name: pitch dtype: string - name: speaking_rate dtype: string - name: noise dtype: string - name: utterance_pitch_mean dtype: float64 - name: snr dtype: float64 - name: phonemes dtype: string splits: - name: train num_bytes: 925936580 num_examples: 944820 download_size: 321934506 dataset_size: 925936580 configs: - config_name: default data_files: - split: train path: data/train-* --- # ParaSpeechCaps Intrinsic Training Dataset Training dataset for the **ParaSpeechCLAP-Intrinsic** and **ParaSpeechCLAP-Combined** models, from the paper: [*ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining*](https://arxiv.org/abs/2603.28737) Anuj Diwan, Eunsol Choi, David Harwath *Under review* This dataset contains the **intrinsic-tag subset** of [ParaSpeechCaps](https://huggingface.co/datasets/ajd12342/paraspeechcaps), filtered to include examples annotated with at least one **intrinsic (speaker-level)** style tag. It is used to train the ParaSpeechCLAP-Intrinsic model with a contrastive + classification multitask loss and class-balanced sampling and the ParaSpeechCLAP-Combined model with a contrastive loss. ## Installation Install the `datasets` package to load the dataset: ```bash pip install datasets ``` To train ParaSpeechCLAP models using this dataset, install the [ParaSpeechCLAP GitHub repository](https://github.com/ajd12342/paraspeechclap): ```bash git clone https://github.com/ajd12342/paraspeechclap.git cd paraspeechclap pip install -r requirements.txt ``` ### Setting up audio files The dataset contains a `relative_audio_path` column but not the audio files themselves. Resolving audio paths requires specifying `data.audio_root`, a common root directory organized as `${audio_root}/{source}/`, where `{source}` matches the value of the `source` column in the dataset. This dataset includes examples from **VoxCeleb**, **Expresso**, **EARS**, and **Emilia**. Follow the [ParaSpeechCaps audio setup instructions](https://github.com/ajd12342/paraspeechcaps/tree/main/dataset#22-processing-dataset-audio) for those sources, with the following adjustment: instead of placing each source at its own root directory, place them under a common root: - `${audio_root}/voxceleb/` (instead of `${voxceleb_root}`) - `${audio_root}/expresso/` (instead of `${expresso_root}`) - `${audio_root}/ears/` (instead of `${ears_root}`) - `${audio_root}/emilia/` (instead of `${emilia_root}`) Then pass `data.audio_root=${audio_root}` when running any ParaSpeechCLAP script. ## Usage with ParaSpeechCLAP ### Training ```bash torchrun --nproc_per_node=4 scripts/train.py \ --config-name train/intrinsic \ data.audio_root=/path/to/audio_root \ meta.results=./experiments ``` ### Loading the dataset ```python from datasets import load_dataset dataset = load_dataset("ajd12342/paraspeechcaps-intrinsic-train", split="train") print(f"Number of examples: {len(dataset)}") print(dataset[0]) ``` ## Related Resources - **GitHub Repository:** [https://github.com/ajd12342/paraspeechclap](https://github.com/ajd12342/paraspeechclap) - **Models:** [ajd12342/paraspeechclap-intrinsic](https://huggingface.co/ajd12342/paraspeechclap-intrinsic), [ajd12342/paraspeechclap-situational](https://huggingface.co/ajd12342/paraspeechclap-situational) and [ajd12342/paraspeechclap-combined](https://huggingface.co/ajd12342/paraspeechclap-combined) - **Parent Dataset:** [https://huggingface.co/datasets/ajd12342/paraspeechcaps](https://huggingface.co/datasets/ajd12342/paraspeechcaps) - **Training Datasets:** [https://huggingface.co/datasets/ajd12342/paraspeechcaps-intrinsic-train](https://huggingface.co/datasets/ajd12342/paraspeechcaps-intrinsic-train) and [https://huggingface.co/datasets/ajd12342/paraspeechcaps-situational-train](https://huggingface.co/datasets/ajd12342/paraspeechcaps-situational-train) - **Evaluation Datasets:** [https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-intrinsic](https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-intrinsic), [https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-situational](https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-situational) and [https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-combined](https://huggingface.co/datasets/ajd12342/paraspeechclap-eval-combined) ## Citation ```bibtex @misc{diwan2026paraspeechclapdualencoderspeechtextmodel, title={ParaSpeechCLAP: A Dual-Encoder Speech-Text Model for Rich Stylistic Language-Audio Pretraining}, author={Anuj Diwan and Eunsol Choi and David Harwath}, year={2026}, eprint={2603.28737}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2603.28737}, } ```

提供机构：

ajd12342

5,000+

优质数据集

54 个

任务类型

进入经典数据集