yfyeung/FCaps

Name: yfyeung/FCaps
Creator: yfyeung
Published: 2026-02-07 16:29:59
License: 暂无描述

Hugging Face2026-02-07 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/yfyeung/FCaps

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-nc-sa-4.0 pretty_name: FCaps viewer: true size_categories: - 10M<n<100M dataset_info: - config_name: FCaps configs: - config_name: FCaps data_files: - split: Emilia path: data/fcaps-emilia.jsonl.gz - split: PSCBase path: data/fcaps-pscbase-train_base.jsonl.gz - split: dev path: data/fcaps-dev.jsonl.gz - split: test path: data/fcaps-test.jsonl.gz --- # FCaps ### Dataset Description **FCaps** is a large-scale dataset with open-ended and fine-grained speaking style descriptions, encompassing 47k hours of speech and 19M captions. The source audio files for FCaps-PSCBase, dev, and test splits are available via [Google Drive](https://drive.google.com/drive/folders/1dIADNOg5l370tAvT6dUvpdecEbszmkmr?usp=sharing). For FCaps-Emilia, use the following command to download the source audio files: ```bash huggingface-cli download \ amphion/Emilia-Dataset \ --repo-type dataset \ --include "Emilia/EN/**" \ --local-dir . \ --local-dir-use-symlinks False ``` Expected directory structure: ``` download ├── Emilia │ └── EN ├── ears ├── expresso ├── voxceleb1 └── voxceleb2 ``` ### Dataset Statistics | Split | Number of Speech Clips | Number of Captions | Duration (hours) | |-------|-------------------|------------------|------------------| | FCaps-Emilia | 18,131,371 | 18,131,371 | 46,787 | | FCaps-PSCBase (train_base) | 114,684 | 1,071,519 | 267 | | dev | 11,772 | 23,544 |-| | test | 241 | 482 |-| ### Data Fields The dataset follows the Lhotse MonoCut schema. Each item contains the following fields: - id (string): Unique identifier for the audio cut. - start (float): The start time of the cut relative to the underlying recording (always 0.0). - duration (float): The duration of the audio cut in seconds. - channel (int): The channel index (always 0). - type (string): The type of the cut ("MonoCut"). - recording (dict). - sampling_rate (int): The sampling rate of the audio. - num_samples (int): Total number of samples in the cut. - sources (list): A list containing the file path or URL to the raw audio. - supervisions (list of dict) - id (string): Unique identifier for the supervision segment. - text (string): The transcript of the spoken content. - speaker (string): The name or ID of the speaker. - gender (string): The gender of the speaker. - custom (dict) - accent (string): The accent of the speaker. - pitch (string): The perceived pitch level. - speaking_rate (string): The perceived speed of speech. - intrinsic_tags (list of strings): The intrinsic tags of speech. - situational_tags (list of strings): The situational tags of speech. - global_captions (list of strings): The global captions of speech. - finegrained_captions (list of strings): The fine-grained captions of speech. ### Caption Taxonomy We define two types of textual supervision: - *Global captions* provide a holistic description of the speech that summarizes speaker-related attributes, encompassing intrinsic traits tied to a speaker’s identity and stable across utterances, and situational traits that may vary across utterances. Such descriptions are atemporal in nature and do not narrate intra-utterance variations. - *Fine-grained captions* extend beyond a holistic speaker profile by providing a temporal and narrative structure that tracks within-clip dynamics such as style shifts, prosodic variations, emphasis patterns, and non-verbal vocalizations, and may further encode the speaker’s delivery style, communicative role, and communicative intent. Together, they provide multi-granular views of the same speech signal, thereby supporting fine-grained contrastive learning of a unified representation across multiple granularities. ### Acknowledgment ParaSpeechCaps PSC-Base: https://huggingface.co/datasets/ajd12342/paraspeechcaps Emilia: https://huggingface.co/datasets/amphion/Emilia-Dataset EARS: https://github.com/facebookresearch/ears_dataset Expresso: https://github.com/facebookresearch/textlesslib/tree/main/examples/expresso/dataset VoxCeleb: https://mm.kaist.ac.kr/datasets/voxceleb ### Citation Please cite our paper if you find this work useful: ```bibtex @misc{yang2026clsp, title={Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training}, author={Yifan Yang and Bing Han and Hui Wang and Wei Wang and Ziyang Ma and Long Zhou and Zengrui Jin and Guanrou Yang and Tianrui Wang and Xu Tan and Xie Chen}, year={2026}, eprint={2601.03065}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2601.03065}, } ``` ### License This dataset is released under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International) license. This is a derivative work that aggregates and modifies the following datasets. Users of this dataset must adhere to the terms of the original licenses as follows: - ParaSpeechCaps: Licensed under CC BY-NC-SA 4.0. - Emilia: Licensed under CC BY-NC 4.0. The copyright remains with the original owners of the videos or audio. - EARS: Licensed under CC BY-NC 4.0. - Expresso: Licensed under CC BY-NC 4.0. - VoxCeleb: Licensed under CC BY 4.0. The copyright remains with the original owners of the video.

提供机构：

yfyeung

5,000+

优质数据集

54 个

任务类型

进入经典数据集