ciempiess/wikipedia_spanish

Name: ciempiess/wikipedia_spanish
Creator: ciempiess
Published: 2024-10-16 02:56:07
License: 暂无描述

Hugging Face2024-10-16 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/ciempiess/wikipedia_spanish

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-3.0 dataset_info: config_name: wikipedia_spanish features: - name: audio_id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: speaker_id dtype: string - name: gender dtype: string - name: duration dtype: float32 - name: normalized_text dtype: string splits: - name: train num_bytes: 1916035098.198 num_examples: 11569 download_size: 1887119130 dataset_size: 1916035098.198 configs: - config_name: wikipedia_spanish data_files: - split: train path: wikipedia_spanish/train-* default: true task_categories: - automatic-speech-recognition language: - es tags: - wikipedia grabada - wikipedia spanish - ciempiess-unam - ciempiess-unam project - read speech - spanish speech pretty_name: WIKIPEDIA SPANISH CORPUS size_categories: - 10K<n<100K --- # Dataset Card for wikipedia_spanish ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [CIEMPIESS-UNAM Project](https://ciempiess.org/) - **Repository:** [WIKIPEDIA SPANISH CORPUS at LDC](https://catalog.ldc.upenn.edu/LDC2021S07) - **Point of Contact:** [Carlos Mena](mailto:carlos.mena@ciempiess.org) ### Dataset Summary According to the project page of the [WikiProject Spoken Wikipedia](https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spoken_Wikipedia): The WikiProject Spoken Wikipedia aims to produce recordings of Wikipedia articles being read aloud. Therefore, the WIKIPEDIA SPANISH CORPUS is a dataset created from the Spanish version of the WikiProject Spoken Wikipedia, called [Wikipedia Grabada](https://es.wikipedia.org/wiki/Wikiproyecto:Wikipedia_grabada) The WIKIPEDIA SPANISH CORPUS aims to be used in the Automatic Speech Recognition (ASR) task. It is a gender unbalanced corpus of 25 hours of duration. It contains read speech of several articles of the Wikipedia Grabada; most of such articles are recorded by male speakers. Transcriptions in this corpus were generated from the scratch by native speakers. ### Example Usage The WIKIPEDIA SPANISH CORPUS contains only the train split: ```python from datasets import load_dataset wikipedia_spanish = load_dataset("ciempiess/wikipedia_spanish") ``` It is also valid to do: ```python from datasets import load_dataset wikipedia_spanish = load_dataset("ciempiess/wikipedia_spanish",split="train") ``` ### Supported Tasks automatic-speech-recognition: The dataset can be used to test a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). ### Languages The language of the corpus is Spanish. ## Dataset Structure ### Data Instances ```python { 'audio_id': 'WKSP_F_0019_E1_0023', 'audio': { 'path': '/home/carlos/.cache/HuggingFace/datasets/downloads/extracted/d31e2de01af3d9cdc4d4b6397720048caa39f6e552a8d38b6617a50af250bdcb/train/female/F_0019/WKSP_F_0019_E1_0023.flac', 'array': array([0.08535767, 0.13946533, 0.11572266, ..., 0.13168335, 0.12426758, 0.14508057], dtype=float32), 'sampling_rate': 16000 }, 'speaker_id': 'F_0019', 'gender': 'female', 'duration': 8.170000076293945, 'normalized_text': 'donde revelaba de sus placas de vidrio al colodión controversias y equivocaciones' } ``` ### Data Fields * `audio_id` (string) - id of audio segment * `audio` (datasets.Audio) - a dictionary containing the path to the audio, the decoded audio array, and the sampling rate. In non-streaming mode (default), the path points to the locally extracted audio. In streaming mode, the path is the relative path of an audio inside its archive (as files are not downloaded and extracted locally). * `speaker_id` (string) - id of speaker * `gender` (string) - gender of speaker (male or female) * `duration` (float32) - duration of the audio file in seconds. * `normalized_text` (string) - normalized audio segment transcription ### Data Splits The corpus counts just with the train split which has a total of 11569 speech files from 43 female speakers and 150 male speakers with a total duration of 25 hours and 37 minutes. ## Dataset Creation ### Curation Rationale The WIKIPEDIA SPANISH CORPUS (WSC) has the following characteristics: * The WSC has an exact duration of 25 hours and 37 minutes. It has 11569 audio files. * The WSC counts with 193 different speakers: 150 men and 43 women. * Every audio file in the WSC has a duration between 3 and 10 seconds approximately. * Data in WSC is classified by speaker. It means, all the recordings of one single speaker are stored in one single directory. * Data is also classified according to the gender (male/female) of the speakers. * Audio and transcriptions in the WSC are segmented and transcribed from the scratch by native speakers of the Spanish language * Audio files in the WSC are distributed in a 16khz@16bit mono format. * Every audio file has an ID that is compatible with ASR engines such as Kaldi and CMU-Sphinx. ### Source Data #### Initial Data Collection and Normalization The WIKIPEDIA SPANISH CORPUS is a speech corpus designed to train acoustic models for automatic speech recognition and it is made out of several articles of the [Wikipedia Grabada](https://es.wikipedia.org/wiki/Wikiproyecto:Wikipedia_grabada) read by volunteers. ### Annotations #### Annotation process The annotation process is at follows: * 1. A whole podcast is manually segmented keeping just the portions containing good quality speech. * 2. A second pass os segmentation is performed; this time to separate speakers and put them in different folders. * 3. The resulting speech files between 5 and 10 seconds are transcribed by students from different departments (computing, engineering, linguistics). Most of them are native speakers but not with a particular training as transcribers. #### Who are the annotators? The WIKIPEDIA SPANISH CORPUS was created under the umbrella of the social service program ["Desarrollo de Tecnologías del Habla"](http://profesores.fi-b.unam.mx/carlos_mena/servicio.html) of the ["Facultad de Ingeniería"](https://www.ingenieria.unam.mx/) (FI) in the ["Universidad Nacional Autónoma de México"](https://www.unam.mx/) (UNAM) between 2018 and 2020 by Carlos Daniel Hernández Mena, head of the program. ### Personal and Sensitive Information The dataset could contain names revealing the identity of some speakers; on the other side, the recordings come from publicly available podcasts, so, there is not a real intent of the participants to be anonymized. Anyway, you agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is valuable because it contains well pronounced speech with low noise. ### Discussion of Biases The dataset is not gender balanced. It is comprised of 43 female speakers and 150 male speakers. ### Other Known Limitations WIKIPEDIA SPANISH CORPUS by Carlos Daniel Hernández Mena is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en) and it utilizes material from [Wikipedia Grabada](https://es.wikipedia.org/wiki/Wikiproyecto:Wikipedia_grabada). This work was done with the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ### Dataset Curators The dataset was collected by students belonging to the social service program ["Desarrollo de Tecnologías del Habla"](http://profesores.fi-b.unam.mx/carlos_mena/servicio.html). It was curated by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena) in 2020. ### Licensing Information [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en) ### Citation Information ``` @misc{carlosmena2021wikipediaspanish, title={WIKIPEDIA SPANISH CORPUS: Audio and Transcriptions taken from Wikipedia Grabada}, ldc_catalog_no={LDC2021S07}, DOI={https://doi.org/10.35111/7m1j-sa17}, author={Hernandez Mena, Carlos Daniel and Meza Ruiz, Ivan Vladimir}, journal={Linguistic Data Consortium, Philadelphia}, year={2021}, url={https://catalog.ldc.upenn.edu/LDC2021S07}, } ``` ### Contributions The authors would like to thank to Alberto Templos Carbajal, Elena Vera and Angélica Gutiérrez for their support to the social service program "Desarrollo de Tecnologías del Habla" at the Facultad de Ingeniería (FI) of the Universidad Nacional Autónoma de México (UNAM). We also thank to the social service students for all the hard work. Special thanks to the Team of "Wikipedia Grabada" for publishing all the recordings that constitute the WIKIPEDIA SPANISH CORPUS. This dataset card was created as part of the objectives of the 16th edition of the Severo Ochoa Mobility Program (PN039300 - Severo Ochoa 2021 - E&T).

--- 许可证：CC BY-SA 3.0 数据集信息：配置名称：wikipedia_spanish 数据字段： - 名称：audio_id，类型：字符串 - 名称：audio，类型：音频：采样率：16000Hz - 名称：speaker_id，类型：字符串 - 名称：gender，类型：字符串 - 名称：duration，类型：float32 - 名称：normalized_text，类型：字符串数据划分： - 名称：train（训练集），字节数：1916035098.198，样本数：11569 下载大小：1887119130字节数据集总大小：1916035098.198字节配置项： - 配置名称：wikipedia_spanish 数据文件： - 划分：train，路径：wikipedia_spanish/train-* 为默认配置任务类别： - 自动语音识别（automatic-speech-recognition）语言： - 西班牙语（es）标签： - 维基百科有声版（wikipedia grabada）、西班牙语维基百科（wikipedia spanish）、CIEMPIESS-UNAM、CIEMPIESS-UNAM项目、朗读语音（read speech）、西班牙语语音（spanish speech）可读名称：西班牙语维基语料库（WIKIPEDIA SPANISH CORPUS）规模类别：10K<n<100K --- # 西班牙语维基语料库数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据样例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差说明](#discussion-of-biases) - [已知其他限制](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [致谢](#contributions) ## 数据集描述 - **主页：** [CIEMPIESS-UNAM项目](https://ciempiess.org/) - **代码仓库：** [语言数据联盟（LDC）中的西班牙语维基语料库](https://catalog.ldc.upenn.edu/LDC2021S07) - **联络人：** [Carlos Mena](mailto:carlos.mena@ciempiess.org) ### 数据集概述据[有声维基百科项目（WikiProject Spoken Wikipedia）](https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spoken_Wikipedia)的项目页面所述：有声维基百科项目旨在制作维基百科文章的朗读录音。因此，西班牙语维基语料库（WIKIPEDIA SPANISH CORPUS）源自有声维基百科项目的西班牙语版本——[维基百科有声版（Wikipedia Grabada）](https://es.wikipedia.org/wiki/Wikiproyecto:Wikipedia_grabada)。本语料库旨在用于自动语音识别（Automatic Speech Recognition, ASR）任务，总时长约25小时，且性别分布不均衡：大部分朗读录音由男性说话人完成，内容取自维基百科有声版的多篇文章。语料库中的转录文本均由西班牙语母语者从头标注生成。 ### 示例用法西班牙语维基语料库仅包含训练划分： python from datasets import load_dataset wikipedia_spanish = load_dataset("ciempiess/wikipedia_spanish") 也可通过如下方式加载指定划分： python from datasets import load_dataset wikipedia_spanish = load_dataset("ciempiess/wikipedia_spanish",split="train") ### 支持任务自动语音识别：本数据集可用于测试自动语音识别（ASR）模型。模型接收音频文件后，需将其转录为书面文本，最常用的评估指标为词错误率（Word Error Rate, WER）。 ### 语言本语料库的语言为西班牙语。 ## 数据集结构 ### 数据样例 python { 'audio_id': 'WKSP_F_0019_E1_0023', 'audio': { 'path': '/home/carlos/.cache/HuggingFace/datasets/downloads/extracted/d31e2de01af3d9cdc4d4b6397720048caa39f6e552a8d38b6617a50af250bdcb/train/female/F_0019/WKSP_F_0019_E1_0023.flac', 'array': array([0.08535767, 0.13946533, 0.11572266, ..., 0.13168335, 0.12426758, 0.14508057], dtype=float32), 'sampling_rate': 16000 }, 'speaker_id': 'F_0019', 'gender': 'female', 'duration': 8.170000076293945, 'normalized_text': 'donde revelaba de sus placas de vidrio al colodión controversias y equivocaciones' } ### 数据字段 * `audio_id`（字符串）：音频段的唯一标识符 * `audio`（datasets.Audio类型）：包含音频文件路径、解码后的音频数组与采样率的字典。非流式模式（默认模式）下，路径指向本地已解压的音频文件；流式模式下，路径为音频在归档文件内的相对路径（因文件未在本地下载解压）。 * `speaker_id`（字符串）：说话人的唯一标识符 * `gender`（字符串）：说话人的性别（男性或女性） * `duration`（float32类型）：音频文件的时长，单位为秒 * `normalized_text`（字符串）：音频段的归一化转录文本 ### 数据划分本语料库仅包含训练划分，共计11569条语音样本，来自43位女性说话人与150位男性说话人，总时长为25小时37分钟。 ## 数据集构建 ### 构建初衷西班牙语维基语料库（WSC）具备如下特征： * 总时长精确为25小时37分钟，共包含11569条音频文件。 * 共计193位不同的说话人，其中男性150位，女性43位。 * 语料库中每条音频文件的时长约为3至10秒。 * 语料库按说话人分类，即同一说话人的所有录音存储在同一个目录下。 * 同时也按说话人的性别（男/女）进行分类。 * 语料库中的音频与转录文本均由西班牙语母语者从头进行分段与标注。 * 语料库中的音频文件采用16kHz@16bit单声道格式存储。 * 每条音频文件的标识符兼容Kaldi、CMU-Sphinx等主流自动语音识别引擎。 ### 源数据 #### 初始数据采集与归一化西班牙语维基语料库是为训练自动语音识别声学模型而构建的语音语料库，其内容取自[维基百科有声版](https://es.wikipedia.org/wiki/Wikiproyecto:Wikipedia_grabada)中由志愿者朗读的多篇文章。 ### 标注信息 #### 标注流程标注流程如下： * 1. 对完整播客进行人工分段，仅保留语音质量合格的片段。 * 2. 进行第二轮分段，按说话人进行区分，将不同说话人的录音存入独立文件夹。 * 3. 由来自计算机、工程、语言学等不同院系的学生对时长5至10秒的语音文件进行转录。大部分标注者为西班牙语母语者，但未接受过专业转录训练。 #### 标注者信息西班牙语维基语料库于2018至2020年间，由墨西哥国立自治大学（Universidad Nacional Autónoma de México, UNAM）工程学院（Facultad de Ingeniería, FI）的“语音技术开发”社会服务项目"Desarrollo de Tecnologías del Habla"（http://profesores.fi-b.unam.mx/carlos_mena/servicio.html）负责人Carlos Daniel Hernández Mena主导创建。 ### 个人与敏感信息本数据集可能包含可揭示部分说话人身份的姓名；但由于录音源自公开播客，参与者并未刻意进行匿名处理。使用本数据集时，请您承诺不会尝试推断数据中说话人的身份。 ## 数据集使用注意事项 ### 数据集的社会影响本数据集具有较高应用价值，因其包含发音清晰、噪声较低的语音样本。 ### 偏差说明本数据集的性别分布不均衡，仅包含43位女性说话人与150位男性说话人。 ### 已知其他限制 Carlos Daniel Hernández Mena发布的西班牙语维基语料库采用知识共享署名-相同方式共享3.0国际许可协议（[CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en)），并使用了来自[维基百科有声版](https://es.wikipedia.org/wiki/Wikiproyecto:Wikipedia_grabada)的内容。本作品的发布仅为供学术研究使用，不提供任何形式的担保，包括但不限于适销性或特定用途适用性的隐含担保。 ## 附加信息 ### 数据集维护者本数据集由“语音技术开发”社会服务项目的学生参与采集，于2020年由[Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena)完成维护。 ### 许可信息 [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/deed.en) ### 引用信息 bibtex @misc{carlosmena2021wikipediaspanish, title={WIKIPEDIA SPANISH CORPUS: Audio and Transcriptions taken from Wikipedia Grabada}, ldc_catalog_no={LDC2021S07}, DOI={https://doi.org/10.35111/7m1j-sa17}, author={Hernandez Mena, Carlos Daniel and Meza Ruiz, Ivan Vladimir}, journal={Linguistic Data Consortium, Philadelphia}, year={2021}, url={https://catalog.ldc.upenn.edu/LDC2021S07}, } ### 致谢作者谨向Alberto Templos Carbajal、Elena Vera与Angélica Gutiérrez致谢，感谢其对墨西哥国立自治大学工程学院“语音技术开发”社会服务项目的支持。同时感谢所有参与项目的社会服务学生的辛勤付出。特别感谢“维基百科有声版”团队发布了构成本语料库的所有录音素材。本数据集卡片的制作是第16届塞韦罗·奥乔亚流动项目（PN039300 - Severo Ochoa 2021 - E&T）目标任务的一部分。

提供机构：

ciempiess

原始信息汇总

数据集概述

许可证信息

许可证类型: CC-BY-SA-3.0

5,000+

优质数据集

54 个

任务类型

进入经典数据集