Jaspernl/The_Spoken_Wikipedia_Corpora_Dutch_ASR_Hiidden

Name: Jaspernl/The_Spoken_Wikipedia_Corpora_Dutch_ASR_Hiidden
Creator: Jaspernl
Published: 2024-07-06 09:14:41
License: 暂无描述

Hugging Face2024-07-06 更新2024-07-06 收录

下载链接：

https://hf-mirror.com/datasets/Jaspernl/The_Spoken_Wikipedia_Corpora_Dutch_ASR_Hiidden

下载链接

链接失效反馈

官方服务：

资源简介：

Spoken Wikipedia Corpora（SWC）是一个包含荷兰语维基百科文章的对齐语音语料库。该语料库包含大约210小时的音频、转录和元数据，适用于语音识别、韵律和语言研究。数据集的结构包括每个实例的音频文件和对应的转录文本。数据集的创建过程包括从维基百科项目中收集数据、对齐音频和文本、清理和处理音频片段等步骤。数据集不包含个人或敏感信息，主要用于开发无障碍技术，但也可能存在维基百科内容和志愿者读者人口特征带来的偏见。

The Spoken Wikipedia Corpora (SWC) is a collection of aligned spoken Wikipedia articles including articles in Dutch. It includes approximately 210 hours of audio, transcriptions, and metadata. The corpus is licensed under CC BY-NC-SA 4.0 and includes valuable resources for research in speech recognition, prosody, and language studies. The dataset structure includes audio files and their corresponding transcriptions for each instance. The creation process involves collecting data from the Spoken Wikipedia project, aligning audio with text, and cleaning and processing audio fragments. The dataset does not contain personal or sensitive information and is primarily used for developing accessible technologies, though it may reflect biases present in Wikipedia content and the demographic characteristics of its volunteer readers.

提供机构：

Jaspernl

原始信息汇总

数据集概述

数据集简介

The Spoken Wikipedia Corpora (SWC) 是一个包含荷兰语的口述维基百科文章的对齐集合。该数据集包含约210小时的音频、转录文本和元数据，适用于语音识别、韵律和语言研究。

支持的任务和排行榜

自动语音识别 (ASR)

语言

荷兰语

数据集结构

数据实例

每个实例代表一个口述维基百科文章片段，包含音频文件及其转录文本。

示例实例

json { "file_name": "data/audio_1_0.mp3", "transcription": "ziek is een buurtschap in de nederlandse gemeente montferland" }

数据字段

file_name: 音频文件的路径
transcription: 音频的对应转录文本

数据分割

数据集未明确划分为训练、验证和测试集，用户可以根据特定需求自行创建分割。

数据集创建

数据收集和规范化

数据来源于口述维基百科项目，志愿者朗读维基百科文章，音频与文本在句子级别对齐。

数据处理

使用Whisper v3-large提取带时间戳的句子。
与原始（规范化）转录文本匹配，基于相似度评分保留原始文本。
裁剪音频文件并添加对应转录文本。

注释

注释过程

注释包括句子级别的对齐，将音频与对应文本链接。注释由SWC项目开发者使用自动化工具创建，并手动验证以确保准确性。

个人信息和敏感信息

数据集不包含个人信息或敏感信息，由公开的维基百科文章和志愿者朗读组成。

使用数据的注意事项

数据集的社会影响

该数据集有助于开发对音频内容有需求的个人使用的可访问技术，如视障人士或有阅读困难的人。

数据集的偏见

数据集可能反映维基百科内容和志愿者读者群体的特征，影响语音模式和话题的多样性。

其他已知限制

数据集可能存在不完整的转录文本和偶尔的对齐错误。部分介绍性和免责声明文本未包含在原始转录中，需要手动处理。

附加信息

数据集许可证

数据集采用CC BY-NC-SA 4.0许可证。

引用信息

如使用该数据集，请引用以下论文： bibtex @inproceedings{schweitzer2016spwc, title={The Spoken Wikipedia Corpora}, author={Schweitzer, Arne and Lewin, Florian and Lensch, Armin and Mandera, Paweł and Matuschek, Michael and Schiel, Florian}, booktitle={Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)}, year={2016}, organization={European Language Resources Association (ELRA)}, note={Dataset available at Hugging Face: @JasperNL https://huggingface.co/datasets/Jaspernl/The_Spoken_Wikipedia_Corpora_Dutch_ASR_Hiidden)} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集