ciempiess/voxforge_spanish

Name: ciempiess/voxforge_spanish
Creator: ciempiess
Published: 2024-10-16 05:19:29
License: 暂无描述

Hugging Face2024-10-16 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/ciempiess/voxforge_spanish

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: gpl-3.0 dataset_info: config_name: voxforge_spanish features: - name: audio_id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: speaker_id dtype: string - name: country dtype: string - name: gender dtype: string - name: duration dtype: float32 - name: normalized_text dtype: string splits: - name: train num_bytes: 3796458902.464 num_examples: 21692 download_size: 3441019616 dataset_size: 3796458902.464 configs: - config_name: voxforge_spanish data_files: - split: train path: voxforge_spanish/train-* default: true task_categories: - automatic-speech-recognition language: - es tags: - voxforge spanish - read speech - ciempiess-unam project - ciempiess - spanish speech pretty_name: VOXFORGE SPANISH CORPUS size_categories: - 10K<n<100K --- # Dataset Card for voxforge_spanish ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [CIEMPIESS-UNAM Project](https://ciempiess.org/) - **Repository:** [VOXFORGE SPANISH CORPUS at HuggingFace](https://huggingface.co/datasets/ciempiess/voxforge_spanish) - **Point of Contact:** [Carlos Mena](mailto:carlos.mena@ciempiess.org) ### Dataset Summary [VoxForge](https://www.voxforge.org/) was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac). They promise they will make available all submitted audio files under the GPL license, and then 'compile' them into acoustic models for use with Open Source speech recognition engines such as CMU Sphinx, ISIP, Julius and HTK. According to this, we downloaded the Spanish recordings of Voxforge in 2016 to create the VOXFORGE SPANISH CORPUS. The VOXFORGE SPANISH CORPUS has a duration of 49 hours and it is constituted by read speech recorded by more than 2 thousand speakers. Most of the speakers contribute with 10 recordings of approximately 10 seconds of duration each. Data is divided by speaker, by gender (male/female) and also by country (Argentina/Chile/LatinAmerica/Mexico/Spain/Unknown). ### Example Usage The VOXFORGE SPANISH CORPUS contains only the train split: ```python from datasets import load_dataset voxforge_spanish = load_dataset("ciempiess/voxforge_spanish") ``` It is also valid to do: ```python from datasets import load_dataset voxforge_spanish = load_dataset("ciempiess/voxforge_spanish",split="train") ``` ### Supported Tasks automatic-speech-recognition: The dataset can be used to test a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). ### Languages The language of the corpus is Spanish. ## Dataset Structure ### Data Instances ```python { 'audio_id': 'VXSP_F_0019_ARG_0033', 'audio': { 'path': '/home/carlos/.cache/HuggingFace/datasets/downloads/extracted/7486eaf05a10c7554cd5de3d32c720fa206d11ad5f76e7f277553b34b1fbb58b/argentina/female/F_0019/VXSP_F_0019_ARG_0033.flac', 'array': array([-0.01412964, -0.02548218, -0.00692749, ..., -0.03274536, -0.03857422, -0.03134155], dtype=float32), 'sampling_rate': 16000 }, 'speaker_id': 'F_0019', 'country': 'argentina', 'gender': 'female', 'duration': 11.5, 'normalized_text': 'todo estaba lo mismo que una hora antes cuando el té humeaba en la taza de ojeda ahora vacía y blanqueaban sobre la mesa los pliegos' } ``` ### Data Fields * `audio_id` (string) - id of audio segment * `audio` (datasets.Audio) - a dictionary containing the path to the audio, the decoded audio array, and the sampling rate. In non-streaming mode (default), the path points to the locally extracted audio. In streaming mode, the path is the relative path of an audio inside its archive (as files are not downloaded and extracted locally). * `speaker_id` (string) - id of speaker * `country` (string) - country of birth of the speaker. * `gender` (string) - gender of speaker (male or female) * `duration` (float32) - duration of the audio file in seconds. * `normalized_text` (string) - normalized audio segment transcription ### Data Splits The corpus counts just with the train split which has a total of 21692 speech files from 467 female speakers and 1713 male speakers with a total duration of 49 hours and 42 minutes. ## Dataset Creation ### Curation Rationale The VOXFORGE SPANISH CORPUS (VSC) has the following characteristics: * The VSC has an exact duration of 49 hours and 42 minutes. It has 21692 audio files. 17053 of those files come from male speakers and 4639 come from female speakers. All recordings come from read speech. * Male speakers contribute with 39h16m56s and female speakers contribute with 10h25m53s. * The VSC counts with 2180 different speakers: 1713 men and 467 women. * Every audio file in the VSC has a duration of 10 seconds approximately. Almost every speaker contributes with 10 recordings. * Data in VSC is classified by speaker. It means, all the recordings of one single speaker are stored in one single directory. * Data is also classified according to the gender (male/female) of the speakers and according to the nationality of the speaker (Argentina/Chile/LatinAmerica/Mexico/Spain/Unknown). * Audio files in the VSC are distributed in a 16khz@16bit mono format. * Information of an specific speaker can be tracked using the "Speaker_Info.xls" file to locate the Voxforge user name of that speaker and then, that user can be located in the Voxforge website. * Every audio file has an ID that is compatible with ASR engines such as Kaldi and CMU-Sphinx. ### Source Data #### Initial Data Collection and Normalization The VOXFORGE SPANISH CORPUS is a speech corpus designed to train acoustic models for automatic speech recognition and it is made out of recordings taken from [VoxForge](https://www.voxforge.org/). ### Annotations #### Annotation process The annotation process is at follows: * 1. A whole podcast is manually segmented keeping just the portions containing good quality speech. * 2. A second pass os segmentation is performed; this time to separate speakers and put them in different folders. * 3. The resulting speech files between 5 and 10 seconds are transcribed by students from different departments (computing, engineering, linguistics). Most of them are native speakers but not with a particular training as transcribers. #### Who are the annotators? The VOXFORGE SPANISH CORPUS was created under the umbrella of the social service program ["Desarrollo de Tecnologías del Habla"](http://profesores.fi-b.unam.mx/carlos_mena/servicio.html) of the ["Facultad de Ingeniería"](https://www.ingenieria.unam.mx/) (FI) in the ["Universidad Nacional Autónoma de México"](https://www.unam.mx/) (UNAM) between 2016 and 2017 by Carlos Daniel Hernández Mena, head of the program. ### Personal and Sensitive Information The dataset could contain names revealing the identity of some speakers; on the other side, the recordings come from publicly available podcasts, so, there is not a real intent of the participants to be anonymized. Anyway, you agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is valuable because it contains well pronounced speech with low noise. ### Discussion of Biases The dataset is not gender balanced. It is comprised of 467 female speakers and 1713 male speakers. ### Other Known Limitations VOXFORGE SPANISH CORPUS by Carlos Daniel Hernández Mena is licensed under a [GPLv3](https://www.gnu.org/licenses/gpl.txt) license and it utilizes material from [Voxforge](http://www.voxforge.org). This work was done with the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ### Dataset Curators The dataset was collected by students belonging to the social service program ["Desarrollo de Tecnologías del Habla"](http://profesores.fi-b.unam.mx/carlos_mena/servicio.html). It was curated by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena) in 2017. ### Licensing Information [GPLv3](https://www.gnu.org/licenses/gpl.txt) ### Citation Information ``` @misc{carlosmena2017voxforgespanish, title={VOXFORGE SPANISH CORPUS: Audio and Transcriptions taken from Voxforge.org}, author={Hernandez Mena, Carlos Daniel}, organization={CIEMPIESS-UNAM Project}, year={2017}, url={https://huggingface.co/datasets/ciempiess/voxforge_spanish}, } ``` ### Contributions The author would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their support to the social service program: "Desarrollo de Tecnologías del Habla." He also thanks to the social service students for all the hard work. Special thanks to the VOXFORGE Team for publishing all the recordings that constitute the VOXFORGE SPANISH CORPUS. This dataset card was created as part of the objectives of the 16th edition of the Severo Ochoa Mobility Program (PN039300 - Severo Ochoa 2021 - E&T).

许可证：gpl-3.0 数据集信息：配置名称：voxforge_spanish 特征： - 名称：audio_id 数据类型：字符串 - 名称：audio 数据类型：音频：采样率：16000 - 名称：speaker_id 数据类型：字符串 - 名称：country 数据类型：字符串 - 名称：gender 数据类型：字符串 - 名称：duration 数据类型：float32 - 名称：normalized_text 数据类型：字符串划分： - 名称：train 字节数：3796458902.464 样本数：21692 配置： - 配置名称：voxforge_spanish 数据文件： - 划分：train 路径：voxforge_spanish/train-* 默认：true 任务类别： - 自动语音识别语言： - es 标签： - voxforge spanish - read speech - ciempiess-unam project - ciempiess - spanish speech 友好名称：VOXFORGE SPANISH CORPUS 规模类别： - 10K<n<100K # voxforge_spanish数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集摘要](#数据集摘要) - [支持的任务](#支持的任务) - [语言](#语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集创建](#数据集创建) - [构建动机](#构建动机) - [源数据](#源数据) - [标注](#标注) - [个人和敏感信息](#个人和敏感信息) - [数据使用注意事项](#数据使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知限制](#其他已知限制) - [附加信息](#附加信息) - [数据集构建者](#数据集构建者) - [许可信息](#许可信息) - [引用信息](#引用信息) - [贡献](#贡献) ## 数据集描述 - **主页**：[CIEMPIESS-UNAM项目](https://ciempiess.org/) - **仓库**：[HuggingFace上的VOXFORGE SPANISH CORPUS](https://huggingface.co/datasets/ciempiess/voxforge_spanish) - **联系人**：[Carlos Mena](mailto:carlos.mena@ciempiess.org) ### 数据集摘要 [VoxForge](https://www.voxforge.org/)旨在收集带转录的语音数据，供自由开源语音识别引擎（适用于Linux、Windows和Mac系统）使用。他们承诺所有提交的音频文件都将在GPL许可证下公开，并将其编译为声学模型，供CMU Sphinx、ISIP、Julius和HTK等开源语音识别引擎使用。据此，我们于2016年下载了VoxForge的西班牙语录音，创建了VOXFORGE SPANISH CORPUS（VOXFORGE西班牙语语料库）。 VOXFORGE西班牙语语料库时长为49小时，由超过2000名说话人的朗读语音组成。大多数说话人贡献了10段时长约10秒的录音。数据按说话人、性别（男/女）以及国家/地区（阿根廷/智利/拉丁美洲/墨西哥/西班牙/未知）划分。 ### 示例用法 VOXFORGE西班牙语语料库仅包含训练集： python from datasets import load_dataset voxforge_spanish = load_dataset("ciempiess/voxforge_spanish") 也可执行以下操作： python from datasets import load_dataset voxforge_spanish = load_dataset("ciempiess/voxforge_spanish",split="train") ### 支持的任务自动语音识别：该数据集可用于测试自动语音识别（Automatic Speech Recognition, ASR）模型。模型接收音频文件并被要求将其转录为书面文本。最常用的评估指标是词错误率（Word Error Rate, WER）。 ### 语言语料库的语言为西班牙语。 ## 数据集结构 ### 数据实例 python { 'audio_id': 'VXSP_F_0019_ARG_0033', 'audio': { 'path': '/home/carlos/.cache/HuggingFace/datasets/downloads/extracted/7486eaf05a10c7554cd5de3d32c720fa206d11ad5f76e7f277553b34b1fbb58b/argentina/female/F_0019/VXSP_F_0019_ARG_0033.flac', 'array': array([-0.01412964, -0.02548218, -0.00692749, ..., -0.03274536, -0.03857422, -0.03134155], dtype=float32), 'sampling_rate': 16000 }, 'speaker_id': 'F_0019', 'country': 'argentina', 'gender': 'female', 'duration': 11.5, 'normalized_text': 'todo estaba lo mismo que una hora antes cuando el té humeaba en la taza de ojeda ahora vacía y blanqueaban sobre la mesa los pliegos' } ### 数据字段 * `audio_id`（字符串）：音频ID * `audio`（数据集音频类型）：包含路径、数组和采样率的音频对象 * `speaker_id`（字符串）：说话人ID * `country`（字符串）：说话人所在国家/地区 * `gender`（字符串）：说话人性别 * `duration`（float32）：音频时长（秒） * `normalized_text`（字符串）：标准化的语音转录文本 ### 数据划分该语料库仅包含训练集，共21692个语音文件，来自467名女性说话人和1713名男性说话人，总时长为49小时42分钟。 ## 数据集创建 ### 构建动机 VOXFORGE西班牙语语料库（VSC）具有以下特征： * VSC总时长为49小时42分钟，包含21692个音频文件，其中17053个来自男性说话人，4639个来自女性说话人，所有录音均为朗读语音。 * 男性说话人贡献的录音时长为39小时16分56秒，女性说话人贡献为10小时25分53秒。 * VSC包含2180名不同说话人：1713名男性和467名女性。 * VSC中的每个音频文件时长约为10秒，几乎每位说话人贡献10段录音。 * 数据按说话人分类，即同一说话人的所有录音存储在同一目录中。 * 数据还按说话人性别（男/女）和国籍（阿根廷/智利/拉丁美洲/墨西哥/西班牙/未知）分类。 * VSC中的音频文件采用16kHz@16bit单声道格式。 * 可通过"Speaker_Info.xls"文件跟踪特定说话人的信息，找到其Voxforge用户名，进而在Voxforge网站上定位该用户。 * 每个音频文件的ID与Kaldi和CMU-Sphinx等ASR引擎兼容。 ### 源数据 #### 初始数据收集与标准化 VOXFORGE西班牙语语料库是一个用于训练自动语音识别声学模型的语音语料库，其录音来源于[VoxForge](https://www.voxforge.org/)。 ### 标注 #### 标注流程标注流程如下： * 1. 手动分割整个播客，仅保留高质量语音部分。 * 2. 进行第二次分割，分离不同说话人并放入不同文件夹。 * 3. 将时长在5到10秒之间的语音文件由来自不同部门（计算机、工程、语言学）的学生转录。大多数学生是母语者，但未接受过专业转录培训。 #### 标注者信息 VOXFORGE西班牙语语料库是在墨西哥国立自治大学（UNAM）工程学院（FI）的“语音技术开发”社会服务项目框架下，由项目负责人Carlos Daniel Hernández Mena于2016-2017年创建的。 ### 个人和敏感信息该数据集可能包含泄露部分说话人身份的姓名；另一方面，录音来源于公开可用的播客，因此参与者并无匿名意图。尽管如此，您同意不尝试确定该数据集中说话人的身份。 ## 数据使用注意事项 ### 数据集的社会影响该数据集具有价值，因为它包含发音清晰、低噪声的语音。 ### 偏差讨论数据集性别不平衡，包含467名女性说话人和1713名男性说话人。 ### 其他已知限制 Carlos Daniel Hernández Mena创建的VOXFORGE西班牙语语料库采用[GPLv3](https://www.gnu.org/licenses/gpl.txt)许可证，使用了来自[Voxforge](http://www.voxforge.org)的材料。本工作旨在提供帮助，但不承担任何担保责任；不包含默示的适销性或特定用途适用性担保。 ## 附加信息 ### 数据集构建者该数据集由“语音技术开发”社会服务项目的学生收集，由[Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena)于2017年整理。 ### 许可信息 [GPLv3](https://www.gnu.org/licenses/gpl.txt) ### 引用信息 @misc{carlosmena2017voxforgespanish, title={VOXFORGE SPANISH CORPUS: Audio and Transcriptions taken from Voxforge.org}, author={Hernandez Mena, Carlos Daniel}, organization={CIEMPIESS-UNAM Project}, year={2017}, url={https://huggingface.co/datasets/ciempiess/voxforge_spanish}, } ### 贡献作者感谢Alejandro V. Mena、Elena Vera和Angélica Gutiérrez对“语音技术开发”社会服务项目的支持，也感谢参与项目的学生所做的辛勤工作。特别感谢VOXFORGE团队公开了构成VOXFORGE西班牙语语料库的所有录音。本数据集卡片是Severo Ochoa mobility计划第16版（PN039300 - Severo Ochoa 2021 - E&T）目标的一部分。

提供机构：

ciempiess

原始信息汇总

数据集概述

许可证信息

许可证类型：GPL-3.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集