amphion/Debatts-Data
收藏Hugging Face2024-10-23 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/amphion/Debatts-Data
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- zh
license: cc-by-nc-4.0
size_categories:
- 10B<n<100B
task_categories:
- text-to-speech
pretty_name: Debatts-Data
tags:
- AI
- Debating
- Expressive
dataset_info:
features:
- name: Rebuttal Subject
dtype: string
- name: Audio Name
dtype: string
- name: audio
dtype: audio
- name: text
dtype: string
- name: json
struct:
- name: duration
dtype: float64
- name: key
dtype: string
- name: language
dtype: string
- name: prompt0_wav_path
dtype: string
- name: style_feature
dtype: string
- name: text
dtype: string
- name: wav_path
dtype: string
splits:
- name: train
num_bytes: 9878829.0
num_examples: 8
download_size: 8784465
dataset_size: 9878829.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Debatts-Data: The First Madarin Rebuttal Speech Dataset for Expressive Text-to-Speech Synthesis
The Debatts-Data dataset is the first Madarin rebuttal speech dataset for expressive text-to-speech synthesis. It is constructed from a vast collection of professional Madarin speech data sourced from diverse video platforms and podcasts on the Internet. The in-the-wild collection approach ensures the real and natural rebuttal speech. In addition, the dataset contains annotations of transcription, duration and style embed.
The table and chart below provide the statistic information for the dataset. For some dataset samples and more information regarding Debatts system, please visit the [Debatts project page](https://amphionspace.github.io/debatts/).
## Dataset Specifications
| Attribute | Value |
|----------------------|---------------|
| Language | ZH |
| Number of Speakers | 2,350 (est.) |
| Duration (hrs) | 111 |
| Type | Text + Speech |
| Sample Rate (kHz) | 16 |
| Recorded Method | Wild |
The JSON files in the dataset contain the following keys:
| Key | Description |
|-------------------|----------------------------------------------------------|
| `key` | Unique identifier for each sample in the dataset |
| `text` | Text transcription of the audio |
| `duration` | Duration of the audio clip in seconds |
| `language` | Language of the audio content |
| `wav_path` | Path to the corresponding WAV file |
| `prompt0_wav_path`| Path to the WAV file used as a prompt |
| `style_feature` | Style features associated with the audio sample |
## README 🔥🔥🔥
## Dataset Usage
To utilize the Debatts-Data dataset, you can download the raw audio files from the files and versions. The Debatts-Data.tar.gz contains the training data, while the Debatts-Data_test.tar.gz contains the testing data with extra speaker prompt speech.
*Please note that Debatts-Data does not own the copyright to the audio files; the copyright remains with the original owners of the videos or audio. Users are permitted to use this dataset only for non-commercial purposes under the CC BY-NC-4.0 license.*
language:
- zh
license: 知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)
size_categories:
- 100亿 < 样本数 < 1000亿
task_categories:
- 文本到语音合成(text-to-speech)
pretty_name: Debatts-Data
tags:
- AI
- 辩论(Debating)
- 富有表现力(Expressive)
dataset_info:
features:
- name: 反驳主题(Rebuttal Subject)
dtype: string
- name: 音频名称(Audio Name)
dtype: string
- name: 音频(audio)
dtype: audio
- name: 文本(text)
dtype: string
- name: json
struct:
- name: 时长(duration)
dtype: float64
- name: 唯一标识符(key)
dtype: string
- name: 语言(language)
dtype: string
- name: 提示音频路径(prompt0_wav_path)
dtype: string
- name: 风格特征(style_feature)
dtype: string
- name: 文本(text)
dtype: string
- name: 音频路径(wav_path)
dtype: string
splits:
- name: 训练集(train)
num_bytes: 9878829.0
num_examples: 8
download_size: 8784465
dataset_size: 9878829.0
configs:
- config_name: 默认配置(default)
data_files:
- split: 训练集(train)
path: data/train-*
---
# Debatts-Data:首款面向富有表现力的文本到语音合成(text-to-speech)任务的中文反驳语音数据集
Debatts-Data数据集是全球首款专为富有表现力的文本到语音合成任务打造的中文反驳语音数据集。其数据源自互联网各大视频平台与播客中的海量专业中文语音素材,采用真实场景采集(in-the-wild)方案构建,确保了语音素材的真实性与自然度。此外,该数据集还包含转录文本、音频时长以及风格嵌入标注。
下表与图表展示了该数据集的统计信息。如需查看数据集样本及Debatts系统的更多细节,请访问[Debatts项目页面](https://amphionspace.github.io/debatts/)。
## 数据集规格
| 属性 | 数值 |
|---------------------|---------------|
| 语言 | 中文(ZH) |
| 说话者数量 | 约2350人 |
| 总时长(小时) | 111 |
| 数据类型 | 文本+语音 |
| 采样率(kHz) | 16 |
| 采集方式 | 真实场景采集 |
数据集内的JSON文件包含以下键值说明:
| 键名 | 说明 |
|--------------------|----------------------------------------------------------|
| `key` | 数据集中每个样本的唯一标识符 |
| `text` | 音频对应的文本转录内容 |
| `duration` | 音频片段的时长,单位为秒 |
| `language` | 音频内容的语言 |
| `wav_path` | 对应WAV格式音频文件的存储路径 |
| `prompt0_wav_path` | 用作提示的WAV格式音频文件的存储路径 |
| `style_feature` | 与该音频样本关联的风格特征 |
## README 🔥🔥🔥
## 数据集使用
若要使用Debatts-Data数据集,可从文件与版本板块下载原始音频素材。其中`Debatts-Data.tar.gz`包含训练数据,`Debatts-Data_test.tar.gz`则包含带有额外说话者提示语音的测试数据。
*请注意:Debatts-Data不享有音频文件的版权,版权仍归原视频或音频的所有者所有。用户仅可在知识共享署名-非商业性使用4.0国际许可协议(CC BY-NC 4.0)的框架下,将本数据集用于非商业用途。*
提供机构:
amphion



