CSTR-Edinburgh/vctk

Name: CSTR-Edinburgh/vctk
Creator: CSTR-Edinburgh
Published: 2024-08-14 11:27:34
License: 暂无描述

Hugging Face2024-08-14 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/CSTR-Edinburgh/vctk

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - crowdsourced language: - en license: - cc-by-4.0 multilinguality: - monolingual pretty_name: VCTK size_categories: - 10K<n<100K source_datasets: - original task_categories: - automatic-speech-recognition - text-to-speech - text-to-audio task_ids: [] paperswithcode_id: vctk train-eval-index: - config: main task: automatic-speech-recognition task_id: speech_recognition splits: train_split: train col_mapping: file: path text: text metrics: - type: wer name: WER - type: cer name: CER dataset_info: features: - name: speaker_id dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: file dtype: string - name: text dtype: string - name: text_id dtype: string - name: age dtype: string - name: gender dtype: string - name: accent dtype: string - name: region dtype: string - name: comment dtype: string config_name: main splits: - name: train num_bytes: 40103111 num_examples: 88156 download_size: 11747302977 dataset_size: 40103111 --- # Dataset Card for VCTK ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Edinburg DataShare](https://doi.org/10.7488/ds/2645) - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This CSTR VCTK Corpus includes around 44-hours of speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive. ### Supported Tasks - `automatic-speech-recognition`, `speaker-identification`: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). - `text-to-speech`, `text-to-audio`: The dataset can also be used to train a model for Text-To-Speech (TTS). ### Languages [More Information Needed] ## Dataset Structure ### Data Instances A data point comprises the path to the audio file, called `file` and its transcription, called `text`. ``` { 'speaker_id': 'p225', 'text_id': '001', 'text': 'Please call Stella.', 'age': '23', 'gender': 'F', 'accent': 'English', 'region': 'Southern England', 'file': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac', 'audio': { 'path': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac', 'array': array([0.00485229, 0.00689697, 0.00619507, ..., 0.00811768, 0.00836182, 0.00854492], dtype=float32), 'sampling_rate': 48000 }, 'comment': '' } ``` Each audio file is a single-channel FLAC with a sample rate of 48000 Hz. ### Data Fields Each row consists of the following fields: - `speaker_id`: Speaker ID - `audio`: Audio recording - `file`: Path to audio file - `text`: Text transcription of corresponding audio - `text_id`: Text ID - `age`: Speaker's age - `gender`: Speaker's gender - `accent`: Speaker's accent - `region`: Speaker's region, if annotation exists - `comment`: Miscellaneous comments, if any ### Data Splits The dataset has no predefined splits. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Public Domain, Creative Commons Attribution 4.0 International Public License ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode)) ### Citation Information ```bibtex @inproceedings{Veaux2017CSTRVC, title = {CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit}, author = {Christophe Veaux and Junichi Yamagishi and Kirsten MacDonald}, year = 2017 } ``` ### Contributions Thanks to [@jaketae](https://github.com/jaketae) for adding this dataset.

标注创建者: - 专家生成语言数据创建者: - 众包语言: - 英语许可证: - CC BY 4.0 多语言属性: - 单语种可读名称: VCTK 样本量类别: - 10000 < 样本数 < 100000 源数据集: - 原始数据集任务类别: - 自动语音识别 - 文本转语音 - 文本转音频任务ID: - 无 paperswithcode_id: vctk 训练-评估索引: - 配置: main 任务: 自动语音识别任务ID: 语音识别拆分: 训练拆分: 训练集列映射: file: path text: text 指标: - 类型: wer 名称: 词错误率（WER） - 类型: cer 名称: 字符错误率（CER）数据集信息: 特征: - 名称: speaker_id 数据类型: 字符串 - 名称: audio 数据类型: 音频: 采样率: 48000 - 名称: file 数据类型: 字符串 - 名称: text 数据类型: 字符串 - 名称: text_id 数据类型: 字符串 - 名称: age 数据类型: 字符串 - 名称: gender 数据类型: 字符串 - 名称: accent 数据类型: 字符串 - 名称: region 数据类型: 字符串 - 名称: comment 数据类型: 字符串配置名称: main 拆分: - 名称: train 字节数: 40103111 样本数: 88156 下载大小: 11747302977 数据集大小: 40103111 # VCTK数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与排行榜](#支持任务与排行榜) - [语言](#语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据拆分](#数据拆分) - [数据集构建](#数据集构建) - [构建初衷](#构建初衷) - [源数据](#源数据) - [标注信息](#标注信息) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [许可信息](#许可信息) - [引用信息](#引用信息) - [贡献](#贡献) ## 数据集描述 - **主页:** [爱丁堡数据共享平台](https://doi.org/10.7488/ds/2645) - **代码仓库:** - **论文:** - **排行榜:** - **联系方式:** ### 数据集概述本CSTR VCTK语料库包含约44小时的语音数据，由110名带有不同口音的英语说话人录制。每位说话人朗读约400个句子，这些句子选自报纸、彩虹段落以及语音口音档案所用的诱导式段落。 ### 支持任务与排行榜 - **自动语音识别（automatic-speech-recognition）、说话人识别（speaker-identification）**: 本数据集可用于训练自动语音识别（Automatic Speech Recognition, ASR）模型，该模型接收音频文件并将其转录为书面文本，最常用的评估指标为词错误率（WER）。 - **文本转语音（text-to-speech）、文本转音频（text-to-audio）**: 本数据集也可用于训练文本转语音（Text-To-Speech, TTS）模型。 ### 语言 [需补充更多信息] ## 数据集结构 ### 数据实例一个数据样本包含音频文件路径（字段名为`file`）及其转录文本（字段名为`text`）。 { 'speaker_id': 'p225', 'text_id': '001', 'text': 'Please call Stella.', 'age': '23', 'gender': 'F', 'accent': 'English', 'region': 'Southern England', 'file': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac', 'audio': { 'path': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac', 'array': array([0.00485229, 0.00689697, 0.00619507, ..., 0.00811768, 0.00836182, 0.00854492], dtype=float32), 'sampling_rate': 48000 }, 'comment': '' } 每个音频文件均为单声道FLAC格式，采样率为48000Hz。 ### 数据字段每一行包含以下字段： - `speaker_id`: 说话人唯一标识符 - `audio`: 音频录音数据 - `file`: 音频文件路径 - `text`: 对应音频的文本转录内容 - `text_id`: 文本唯一标识符 - `age`: 说话人年龄 - `gender`: 说话人性别 - `accent`: 说话人口音类型 - `region`: 说话人所属地区（若有标注） - `comment`: 其他备注信息（若有） ### 数据拆分本数据集无预定义的数据拆分方式。 ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言数据提供者是谁？ [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注人员是谁？ [需补充更多信息] ### 个人与敏感信息本数据集包含在线捐赠语音的志愿者信息，请勿尝试识别数据集中的说话人身份。 ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息公有领域，采用知识共享署名4.0国际公共许可协议（CC-BY-4.0，https://creativecommons.org/licenses/by/4.0/legalcode） ### 引用信息 bibtex @inproceedings{Veaux2017CSTRVC, title = {CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit}, author = {Christophe Veaux and Junichi Yamagishi and Kirsten MacDonald}, year = 2017 } ### 贡献感谢[@jaketae](https://github.com/jaketae)贡献本数据集。

提供机构：

CSTR-Edinburgh

原始信息汇总

数据集卡片 VCTK

数据集描述

数据集概述

VCTK数据集包含约44小时的英语语音数据，由110名具有各种口音的说话者录制。每个说话者朗读约400句话，这些句子选自报纸、彩虹段落和用于语音口音档案的诱发段落。

支持的任务

automatic-speech-recognition（自动语音识别）：数据集可用于训练自动语音识别（ASR）模型。模型接收音频文件并将其转录为书面文本。最常见的评估指标是词错误率（WER）。
text-to-speech（文本到语音）：数据集也可用于训练文本到语音（TTS）模型。

语言

数据集中的语言为英语。

数据集结构

数据实例

每个数据点包含音频文件的路径（称为file）及其转录文本（称为text）。

json { "speaker_id": "p225", "text_id": "001", "text": "Please call Stella.", "age": "23", "gender": "F", "accent": "English", "region": "Southern England", "file": "/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac", "audio": { "path": "/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac", "array": [0.00485229, 0.00689697, 0.00619507, ..., 0.00811768, 0.00836182, 0.00854492], "sampling_rate": 48000 }, "comment": "" }

每个音频文件是采样率为48000 Hz的单声道FLAC文件。

数据字段

每行包含以下字段：

speaker_id：说话者ID
audio：音频录音
file：音频文件路径
text：对应音频的文本转录
text_id：文本ID
age：说话者的年龄
gender：说话者的性别
accent：说话者的口音
region：说话者的地区（如果有注释）
comment：其他评论（如果有）

数据分割

数据集没有预定义的分割。

数据集创建

个人和敏感信息

数据集包含在线捐赠语音的人。您同意不尝试确定此数据集中说话者的身份。

附加信息

许可信息

公共领域，Creative Commons Attribution 4.0 International Public License（CC-BY-4.0）

引用信息

bibtex @inproceedings{Veaux2017CSTRVC, title = {CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit}, author = {Christophe Veaux and Junichi Yamagishi and Kirsten MacDonald}, year = 2017 }

搜集汇总

数据集介绍

构建方式

VCTK数据集由110名英语母语者提供，每位参与者朗读约400句，涵盖报纸文章、彩虹段落及语音口音档案中的引言段落。音频数据以FLAC格式存储，采样率为48kHz，确保了高质量的语音记录。数据集的构建旨在为自动语音识别（ASR）和文本转语音（TTS）任务提供丰富的多口音语音资源。

特点

VCTK数据集的显著特点在于其多样化的口音和丰富的语音样本，涵盖了来自不同地区和年龄段的参与者。每个音频文件均附有详细的元数据，包括说话者的年龄、性别、口音和地区信息，为研究语音识别和语音合成提供了多维度的参考。此外，数据集的公开许可（CC-BY-4.0）允许广泛的研究和应用。

使用方法

VCTK数据集适用于自动语音识别和文本转语音任务。用户可以通过加载音频文件和对应的文本转录进行模型训练，常用的评估指标包括词错误率（WER）和字符错误率（CER）。数据集的结构清晰，支持直接用于各类语音处理模型的输入，为研究者提供了便捷的数据处理接口。

背景与挑战

背景概述

VCTK数据集，由爱丁堡大学的CSTR实验室创建，是一个包含约44小时语音数据的英语多说话者语料库。该数据集由110名具有不同口音的英语说话者录制，每位说话者朗读约400句话，这些句子选自报纸、彩虹段落以及用于语音口音档案的诱发段落。VCTK数据集的核心研究问题在于提供一个多样化的语音数据集，以支持自动语音识别（ASR）和文本到语音（TTS）等任务的研究。该数据集的创建旨在解决语音处理领域中多样性不足的问题，并为语音合成和识别技术的发展提供坚实的基础。

当前挑战

VCTK数据集在构建过程中面临多项挑战。首先，确保数据集的多样性，包括不同口音、年龄和性别的说话者，以提高模型的泛化能力。其次，语音数据的采集和标注过程需要高度的专业性和时间投入，尤其是确保语音质量和文本转录的准确性。此外，处理个人和敏感信息的安全性也是一个重要挑战，需确保说话者的隐私得到充分保护。在应用层面，如何有效利用该数据集进行模型训练，以减少语音识别中的错误率（如WER和CER），也是当前研究的重点和难点。

常用场景

经典使用场景

VCTK数据集在语音处理领域中具有广泛的应用，尤其在自动语音识别（ASR）和文本到语音（TTS）任务中表现尤为突出。该数据集包含了110位英语母语者的语音数据，每位参与者朗读约400句话，涵盖了多种口音和方言。这些丰富的语音样本为模型训练提供了多样化的输入，使得模型能够在不同口音和语境下进行准确的语音识别和生成。

解决学术问题

VCTK数据集解决了语音识别领域中多口音和多方言处理的学术难题。通过提供多样化的语音样本，该数据集帮助研究者开发出更具鲁棒性的语音识别系统，能够适应不同口音和方言的挑战。此外，该数据集还推动了文本到语音合成技术的发展，使得生成的语音更加自然和逼真，极大地提升了语音合成系统的质量。

衍生相关工作

基于VCTK数据集，研究者们开发了多种先进的语音处理模型和算法。例如，一些研究工作利用该数据集进行深度学习模型的训练，提升了自动语音识别的准确率和鲁棒性。此外，还有研究者利用该数据集进行语音合成技术的创新，开发出更加自然和流畅的语音合成系统。这些衍生工作不仅推动了语音处理技术的发展，也为相关领域的应用提供了技术支持。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集