---
annotations_creators:
- expert-generated
language_creators:
- crowdsourced
language:
- en
license:
- cc-by-4.0
multilinguality:
- monolingual
pretty_name: VCTK
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- automatic-speech-recognition
- text-to-speech
- text-to-audio
task_ids: []
paperswithcode_id: vctk
train-eval-index:
- config: main
task: automatic-speech-recognition
task_id: speech_recognition
splits:
train_split: train
col_mapping:
file: path
text: text
metrics:
- type: wer
name: WER
- type: cer
name: CER
dataset_info:
features:
- name: speaker_id
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 48000
- name: file
dtype: string
- name: text
dtype: string
- name: text_id
dtype: string
- name: age
dtype: string
- name: gender
dtype: string
- name: accent
dtype: string
- name: region
dtype: string
- name: comment
dtype: string
config_name: main
splits:
- name: train
num_bytes: 40103111
num_examples: 88156
download_size: 11747302977
dataset_size: 40103111
---
# Dataset Card for VCTK
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [Edinburg DataShare](https://doi.org/10.7488/ds/2645)
- **Repository:**
- **Paper:**
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
This CSTR VCTK Corpus includes around 44-hours of speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.
### Supported Tasks
- `automatic-speech-recognition`, `speaker-identification`: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER).
- `text-to-speech`, `text-to-audio`: The dataset can also be used to train a model for Text-To-Speech (TTS).
### Languages
[More Information Needed]
## Dataset Structure
### Data Instances
A data point comprises the path to the audio file, called `file` and its transcription, called `text`.
```
{
'speaker_id': 'p225',
'text_id': '001',
'text': 'Please call Stella.',
'age': '23',
'gender': 'F',
'accent': 'English',
'region': 'Southern England',
'file': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac',
'audio':
{
'path': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac',
'array': array([0.00485229, 0.00689697, 0.00619507, ..., 0.00811768, 0.00836182, 0.00854492], dtype=float32),
'sampling_rate': 48000
},
'comment': ''
}
```
Each audio file is a single-channel FLAC with a sample rate of 48000 Hz.
### Data Fields
Each row consists of the following fields:
- `speaker_id`: Speaker ID
- `audio`: Audio recording
- `file`: Path to audio file
- `text`: Text transcription of corresponding audio
- `text_id`: Text ID
- `age`: Speaker's age
- `gender`: Speaker's gender
- `accent`: Speaker's accent
- `region`: Speaker's region, if annotation exists
- `comment`: Miscellaneous comments, if any
### Data Splits
The dataset has no predefined splits.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
Public Domain, Creative Commons Attribution 4.0 International Public License ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode))
### Citation Information
```bibtex
@inproceedings{Veaux2017CSTRVC,
title = {CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit},
author = {Christophe Veaux and Junichi Yamagishi and Kirsten MacDonald},
year = 2017
}
```
### Contributions
Thanks to [@jaketae](https://github.com/jaketae) for adding this dataset.
标注创建者:
- 专家生成
语言数据创建者:
- 众包
语言:
- 英语
许可证:
- CC BY 4.0
多语言属性:
- 单语种
可读名称: VCTK
样本量类别:
- 10000 < 样本数 < 100000
源数据集:
- 原始数据集
任务类别:
- 自动语音识别
- 文本转语音
- 文本转音频
任务ID:
- 无
paperswithcode_id: vctk
训练-评估索引:
- 配置: main
任务: 自动语音识别
任务ID: 语音识别
拆分:
训练拆分: 训练集
列映射:
file: path
text: text
指标:
- 类型: wer
名称: 词错误率(WER)
- 类型: cer
名称: 字符错误率(CER)
数据集信息:
特征:
- 名称: speaker_id
数据类型: 字符串
- 名称: audio
数据类型:
音频:
采样率: 48000
- 名称: file
数据类型: 字符串
- 名称: text
数据类型: 字符串
- 名称: text_id
数据类型: 字符串
- 名称: age
数据类型: 字符串
- 名称: gender
数据类型: 字符串
- 名称: accent
数据类型: 字符串
- 名称: region
数据类型: 字符串
- 名称: comment
数据类型: 字符串
配置名称: main
拆分:
- 名称: train
字节数: 40103111
样本数: 88156
下载大小: 11747302977
数据集大小: 40103111
# VCTK数据集卡片
## 目录
- [数据集描述](#数据集描述)
- [数据集概述](#数据集概述)
- [支持任务与排行榜](#支持任务与排行榜)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据拆分](#数据拆分)
- [数据集构建](#数据集构建)
- [构建初衷](#构建初衷)
- [源数据](#源数据)
- [标注信息](#标注信息)
- [个人与敏感信息](#个人与敏感信息)
- [数据集使用注意事项](#数据集使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏差讨论](#偏差讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集维护者](#数据集维护者)
- [许可信息](#许可信息)
- [引用信息](#引用信息)
- [贡献](#贡献)
## 数据集描述
- **主页:** [爱丁堡数据共享平台](https://doi.org/10.7488/ds/2645)
- **代码仓库:**
- **论文:**
- **排行榜:**
- **联系方式:**
### 数据集概述
本CSTR VCTK语料库包含约44小时的语音数据,由110名带有不同口音的英语说话人录制。每位说话人朗读约400个句子,这些句子选自报纸、彩虹段落以及语音口音档案所用的诱导式段落。
### 支持任务与排行榜
- **自动语音识别(automatic-speech-recognition)、说话人识别(speaker-identification)**: 本数据集可用于训练自动语音识别(Automatic Speech Recognition, ASR)模型,该模型接收音频文件并将其转录为书面文本,最常用的评估指标为词错误率(WER)。
- **文本转语音(text-to-speech)、文本转音频(text-to-audio)**: 本数据集也可用于训练文本转语音(Text-To-Speech, TTS)模型。
### 语言
[需补充更多信息]
## 数据集结构
### 数据实例
一个数据样本包含音频文件路径(字段名为`file`)及其转录文本(字段名为`text`)。
{
'speaker_id': 'p225',
'text_id': '001',
'text': 'Please call Stella.',
'age': '23',
'gender': 'F',
'accent': 'English',
'region': 'Southern England',
'file': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac',
'audio':
{
'path': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac',
'array': array([0.00485229, 0.00689697, 0.00619507, ..., 0.00811768, 0.00836182, 0.00854492], dtype=float32),
'sampling_rate': 48000
},
'comment': ''
}
每个音频文件均为单声道FLAC格式,采样率为48000Hz。
### 数据字段
每一行包含以下字段:
- `speaker_id`: 说话人唯一标识符
- `audio`: 音频录音数据
- `file`: 音频文件路径
- `text`: 对应音频的文本转录内容
- `text_id`: 文本唯一标识符
- `age`: 说话人年龄
- `gender`: 说话人性别
- `accent`: 说话人口音类型
- `region`: 说话人所属地区(若有标注)
- `comment`: 其他备注信息(若有)
### 数据拆分
本数据集无预定义的数据拆分方式。
## 数据集构建
### 构建初衷
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言数据提供者是谁?
[需补充更多信息]
### 标注信息
#### 标注流程
[需补充更多信息]
#### 标注人员是谁?
[需补充更多信息]
### 个人与敏感信息
本数据集包含在线捐赠语音的志愿者信息,请勿尝试识别数据集中的说话人身份。
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集维护者
[需补充更多信息]
### 许可信息
公有领域,采用知识共享署名4.0国际公共许可协议(CC-BY-4.0,https://creativecommons.org/licenses/by/4.0/legalcode)
### 引用信息
bibtex
@inproceedings{Veaux2017CSTRVC,
title = {CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit},
author = {Christophe Veaux and Junichi Yamagishi and Kirsten MacDonald},
year = 2017
}
### 贡献
感谢[@jaketae](https://github.com/jaketae)贡献本数据集。