alexandrainst/coral-tts

Name: alexandrainst/coral-tts
Creator: alexandrainst
Published: 2024-10-14 11:14:38
License: 暂无描述

Hugging Face2024-10-14 更新2024-04-19 收录

下载链接：

https://hf-mirror.com/datasets/alexandrainst/coral-tts

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: speaker_id dtype: string - name: transcription_id dtype: int64 - name: text dtype: string - name: audio dtype: audio: sampling_rate: 44100 splits: - name: train num_bytes: 12163543668.45736 num_examples: 18863 download_size: 10460673849 dataset_size: 12163543668.45736 configs: - config_name: default data_files: - split: train path: data/train-* license: cc0-1.0 task_categories: - text-to-speech language: - da pretty_name: CoRal TTS size_categories: - 10K<n<100K --- # Dataset Card for CoRal TTS ## Dataset Description - **Repository:** <https://github.com/alexandrainst/coral> - **Point of Contact:** [Dan Saattrup Nielsen](mailto:dan.nielsen@alexandra.dk) - **Size of downloaded dataset files:** 14.63 GB - **Size of the generated dataset:** 15.25 GB - **Total amount of disk used:** 29.88 GB ### Dataset Summary This dataset consists of two professional Danish speakers, female and male, recording roughly 17 hours of Danish speech each. The dataset is part of the [CoRal project](https://alexandra.dk/coral/) which is funded by the [Danish Innovation Fund](https://innovationsfonden.dk/en). The text data was selected by the [Alexandra Institute](https://alexandra.dk/about-the-alexandra-institute/) ([Github repo for the dataset creation](https://github.com/alexandrainst/tts_text)) and consists of sentences from [sundhed.dk](https://sundhed.dk/), [borger.dk](https://borger.dk/), names of bus stops and stations, manually filtered Reddit comments, and dates and times. The audio data was recorded by the public institution [Nota](https://nota.dk/), which is part of the Danish Ministry of Culture. ### Supported Tasks and Leaderboards Speech synthesis is the intended tasks for this dataset. No leaderboard is active at this point. ### Languages The dataset is available in Danish (`da`). ## Dataset Structure ### Data Instances - **Size of downloaded dataset files:** 14.63 GB - **Size of the generated dataset:** 15.25 GB - **Total amount of disk used:** 29.88 GB An example from the dataset looks as follows. ``` { 'speaker_id': 'mic', 'transcription_id': 0, 'text': '26 rigtige.', 'audio': { 'path': 'mic_00001.wav', 'array': array([-0.00054932, -0.00054932, -0.00061035, ..., 0.00027466, 0.00036621, 0.00030518]), 'sampling_rate': 44100 } } ``` ### Data Fields The data fields are the same among all splits. - `speaker_id`: a `string` feature. - `transcription_id`: an `int` feature. - `text`: a `string` feature. - `audio`: an `Audio` feature. ### Dataset Statistics There are 18,863 samples in the dataset. ## Additional Information ### Dataset Curators [Dan Saattrup Nielsen](https://saattrupdan.github.io/) from the [The Alexandra Institute](https://alexandra.dk/) uploaded it to the Hugging Face Hub. ### Licensing Information The dataset is licensed under the [CC0 license](https://creativecommons.org/share-your-work/public-domain/cc0/).

提供机构：

alexandrainst

原始信息汇总

数据集概述

数据集名称

CoRal TTS

数据集描述

该数据集包含两位专业丹麦演讲者（一男一女）录制的约24小时的丹麦语音。数据集是CoRal项目的一部分，由丹麦创新基金资助。

语言

丹麦语 (da)

数据集结构

数据实例

每个数据实例包含以下字段：

speaker_id: 字符串类型
transcription_id: 整数类型
text: 字符串类型
audio: 音频类型，采样率为44100

数据字段

speaker_id: 字符串特征
transcription_id: 整数特征
text: 字符串特征
audio: 音频特征，包含采样率信息

数据集统计

数据集包含23,651个样本

数据集大小

下载大小：14.63 GB
生成数据集大小：15.25 GB

许可证

数据集根据CC0许可证授权。

任务类别

语音合成

数据集创建者

Dan Saattrup Nielsen，来自The Alexandra Institute。

搜集汇总

数据集介绍

构建方式

在语音合成领域，高质量的配对文本-音频数据是构建自然流畅TTS系统的基石。CoRal TTS数据集由两位丹麦专业配音演员录制，每位贡献约17小时的丹麦语语音。文本素材经亚历山德拉研究所精心筛选，涵盖sundhed.dk与borger.dk的官方内容、公交站点名称、经过人工过滤的Reddit评论以及日期时间表达，确保了语料的多样性与实用性。音频由丹麦文化部下属机构Nota专业录制，最终汇聚成包含18,863个样本的高保真语料库。

使用方法

研究者可通过HuggingFace Datasets库便捷加载该数据集，默认配置为train拆分，无需额外拆分。使用时调用load_dataset('alexandrainst/coral-tts')即可获取迭代器，每条样本包含speaker_id标识说话人、text字段提供文本标注、audio字段返回包含路径与波形数组的字典。采样率固定为44100Hz，可直接输入TTS模型进行训练或微调。数据集大小约15.25GB，推荐在具备足够存储与计算资源的环境下使用。

背景与挑战

背景概述

CoRal TTS数据集由丹麦Alexandra研究所的Dan Saattrup Nielsen等人于近年创建，隶属于丹麦创新基金资助的CoRal项目。该数据集聚焦于低资源语言——丹麦语的文本到语音合成研究，旨在填补该领域高质量语音数据的空白。核心研究问题在于如何利用专业语音数据提升丹麦语合成语音的自然度与可懂度。数据集包含两位专业丹麦语发音人（一男一女）各约17小时的录音，共计18863条样本，文本来源涵盖政府网站、公交站名、Reddit评论及日期时间等多样场景，由公共机构Nota负责录制。该数据集以CC0许可公开，为丹麦语TTS研究提供了标准化基准，有力推动了斯堪的纳维亚语系语音技术的发展。

当前挑战

CoRal TTS数据集面临的核心挑战包括：1）丹麦语作为小语种，其语音特征（如清浊音对立、非重读音节弱化）在TTS建模中易被忽视，现有模型多依赖英语数据，迁移效果欠佳，亟需专用数据集以提升合成质量；2）构建过程中，文本筛选需平衡领域覆盖与语音平衡性，从sundhed.dk等专业网站提取句子时面临术语复杂、句法多变的问题，而Reddit评论的过滤则需去除噪声及非标准表达；3）录音环境虽专业，但发音人个体差异（如语速、语调）及长达17小时的录制疲劳可能引入不一致性，影响模型泛化能力。

常用场景

经典使用场景

CoRal TTS数据集为丹麦语语音合成研究提供了高质量的平行语料资源，包含两位专业播音员（一男一女）各自约17小时的录音数据，采样率高达44.1 kHz。该数据集最经典的使用场景是训练端到端文本转语音（TTS）模型，例如基于Tacotron、FastSpeech或VITS等架构的模型。研究者可利用其丰富的文本-音频对齐对，探索丹麦语特有的韵律特征、音节时长分布以及语调模式，进而生成自然流畅的合成语音。

解决学术问题

该数据集有效解决了丹麦语语音合成领域高质量标注数据匮乏的难题。在学术研究中，它支撑了对低资源语言TTS模型的训练与评估，推动了多说话人语音合成、跨说话人音色迁移以及韵律可控生成等方向的发展。通过提供专业化、风格统一的语音样本，CoRal TTS助力研究者深入分析语音信号中的声学特征与文本语义的映射关系，为构建更鲁棒的语音合成系统奠定了数据基础。

实际应用

在实际应用中，CoRal TTS数据集训练的模型可广泛部署于丹麦语语音助手、有声读物自动生成、公共交通语音播报系统以及无障碍阅读工具中。例如，基于该数据集开发的TTS系统能够为视力障碍者提供高质量的文本朗读服务，或为丹麦语学习者提供标准的发音示范。此外，其CC0许可证允许商业使用，进一步降低了企业开发丹麦语语音交互产品的门槛。

数据集最近研究