ciempiess/librivox_spanish
收藏Hugging Face2024-10-16 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/ciempiess/librivox_spanish
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
dataset_info:
config_name: librivox_spanish
features:
- name: audio_id
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: speaker_id
dtype: string
- name: speaker_group
dtype: string
- name: gender
dtype: string
- name: duration
dtype: float32
- name: normalized_text
dtype: string
splits:
- name: train
num_bytes: 6481120844.144
num_examples: 36338
download_size: 5089499872
dataset_size: 6481120844.144
configs:
- config_name: librivox_spanish
data_files:
- split: train
path: librivox_spanish/train-*
default: true
task_categories:
- automatic-speech-recognition
language:
- es
tags:
- librivox spanish
- ciempiess-unam project
- ciempiess-unam
- read speech
- spanish speech
pretty_name: LIBRIVOX SPANISH CORPUS
size_categories:
- 10K<n<100K
---
# Dataset Card for librivox_spanish
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [CIEMPIESS-UNAM Project](https://ciempiess.org/)
- **Repository:** [LIBRIVOX SPANISH CORPUS at LDC](https://catalog.ldc.upenn.edu/LDC2020S01)
- **Point of Contact:** [Carlos Mena](mailto:carlos.mena@ciempiess.org)
### Dataset Summary
Librivox is a non-commercial, non-profit and ad-free project that is dedicated to make all books in the public domain available, for free, in audio format on the internet. According to this, we downloaded 300 titles in Spanish to create the LIBRIVOX SPANISH CORPUS.
The LIBRIVOX SPANISH CORPUS has a duration of 73 hours and it is constituted by audio files between 3 and 10 seconds long, manually segmented. Transcription are also manually made by Spanish native speakers. The recordings are divided between male/female and native/non-native speakers.
### Example Usage
The LIBRIVOX SPANISH CORPUS contains only the train split:
```python
from datasets import load_dataset
librivox_spanish = load_dataset("ciempiess/librivox_spanish")
```
It is also valid to do:
```python
from datasets import load_dataset
librivox_spanish = load_dataset("ciempiess/librivox_spanish",split="train")
```
### Supported Tasks
automatic-speech-recognition: The dataset can be used to test a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER).
### Languages
The language of the corpus is Spanish.
## Dataset Structure
### Data Instances
```python
{
'audio_id': 'LBVX_F_69_NNT_0035',
'audio': {
'path': '/home/carlos/.cache/HuggingFace/datasets/downloads/extracted/a506b24788064c4a89c858f829b408b0d2445c9cc30e52087e38ceee60fa03d7/non_native/female/F_69/LBVX_F_69_NNT_0035.flac',
'array': array([ 2.4414062e-04, -6.1035156e-05, -2.1362305e-04, ...,
-6.1035156e-04, -4.8828125e-04, -7.6293945e-04], dtype=float32), 'sampling_rate': 16000
},
'speaker_id': 'F_69',
'speaker_group': 'non_native',
'gender': 'female',
'duration': 9.975000381469727,
'normalized_text': 'del pequeño dormido en la mejilla que con timido afán su madre besa y se refleja alegre en la fajilla'
}
```
### Data Fields
* `audio_id` (string) - id of audio segment
* `audio` (datasets.Audio) - a dictionary containing the path to the audio, the decoded audio array, and the sampling rate. In non-streaming mode (default), the path points to the locally extracted audio. In streaming mode, the path is the relative path of an audio inside its archive (as files are not downloaded and extracted locally).
* `speaker_id` (string) - id of speaker
* `speaker_group` (string) - native or non native
* `gender` (string) - gender of speaker (male or female)
* `duration` (float32) - duration of the audio file in seconds.
* `normalized_text` (string) - normalized audio segment transcription
### Data Splits
The corpus counts just with the train split which has a total of 36338 speech files from 77 female speakers and 77 male speakers with a total duration of 73 hours and 1 minute.
## Dataset Creation
### Curation Rationale
The LIBRIVOX SPANISH CORPUS (LSC) has the following characteristics:
* The LSC has an exact duration of 73 hours and 1 minute. It has 36338 audio files.
* The LSC counts with 154 different speakers: 77 men and 77 women.
* Every audio file in the LSC has a duration between 3 and 10 seconds approximately.
* Data in LSC is classified by speaker. It means, all the recordings of one single speaker are stored in one single directory.
* Data is also classified according to the gender (male/female) of the speakers and according to the way they speak (native/non-native).
* Audio and transcriptions in the LSC are segmented and transcribed by native speakers of the Spanish language
* Audio files in the LSC are distributed in a 16khz@16bit mono format.
* Every audio file has an ID that is compatible with ASR engines such as Kaldi and CMU-Sphinx.
### Source Data
#### Initial Data Collection and Normalization
The LIBRIVOX SPANISH CORPUS is a speech corpus designed to train acoustic models for automatic speech recognition and it is made out of 300 audio books taken from [Librivox](https://librivox.org/).
### Annotations
#### Annotation process
The annotation process is at follows:
* 1. A whole podcast is manually segmented keeping just the portions containing good quality speech.
* 2. A second pass os segmentation is performed; this time to separate speakers and put them in different folders.
* 3. The resulting speech files between 5 and 10 seconds are transcribed by students from different departments (computing, engineering, linguistics). Most of them are native speakers but not with a particular training as transcribers.
#### Who are the annotators?
The LIBRIVOX SPANISH CORPUS was created under the umbrella of the social service program ["Desarrollo de Tecnologías del Habla"](http://profesores.fi-b.unam.mx/carlos_mena/servicio.html) of the ["Facultad de Ingeniería"](https://www.ingenieria.unam.mx/) (FI) in the ["Universidad Nacional Autónoma de México"](https://www.unam.mx/) (UNAM) between 2016 and 2019 by Carlos Daniel Hernández Mena, head of the program.
### Personal and Sensitive Information
The dataset could contain names revealing the identity of some speakers; on the other side, the recordings come from publicly available podcasts, so, there is not a real intent of the participants to be anonymized. Anyway, you agree to not attempt to determine the identity of speakers in this dataset.
## Considerations for Using the Data
### Social Impact of Dataset
This dataset is valuable because it contains well pronounced speech with low noise.
### Discussion of Biases
The dataset is gender balanced. It is comprised of 77 female speakers and 77 male speakers.
### Other Known Limitations
LIBRIVOX SPANISH CORPUS by Carlos Daniel Hernández Mena is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License [CC-BY-SA-4.0](http://creativecommons.org/licenses/by-sa/4.0/) and it utilizes material from [Librivox](https://librivox.org/). This work was done with the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
### Dataset Curators
The dataset was collected by students belonging to the social service program ["Desarrollo de Tecnologías del Habla"](http://profesores.fi-b.unam.mx/carlos_mena/servicio.html). It was curated by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena) in 2019.
### Licensing Information
[CC-BY-SA-4.0](http://creativecommons.org/licenses/by-sa/4.0/)
### Citation Information
```
@misc{carlosmena2020librivoxspanish,
title={LIBRIVOX SPANISH CORPUS: Audio and Transcriptions taken from Librivox.org},
ldc_catalog_no={LDC2020S01},
DOI={https://doi.org/10.35111/a44z-6x49},
author={Hernandez Mena, Carlos Daniel},
journal={Linguistic Data Consortium, Philadelphia},
year={2020},
url={https://catalog.ldc.upenn.edu/LDC2020S01},
}
```
### Contributions
The author would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their
support to the social service program: "Desarrollo de Tecnologías del Habla." He also thanks
to the social service students for all the hard work.
Special thanks to the Librivox team for publishing all the recordings that constitute the
LIBRIVOX SPANISH CORPUS.
This dataset card was created as part of the objectives of the 16th edition of the Severo Ochoa Mobility Program (PN039300 - Severo Ochoa 2021 - E&T).
---
许可协议:CC BY-SA 4.0
数据集信息:
配置名称:librivox_spanish
数据字段:
- 名称:audio_id
数据类型:字符串
- 名称:audio
数据类型:
音频:
采样率:16000
- 名称:speaker_id
数据类型:字符串
- 名称:speaker_group
数据类型:字符串
- 名称:gender
数据类型:字符串
- 名称:duration
数据类型:float32
- 名称:normalized_text
数据类型:字符串
划分集:
- 名称:train
字节数:6481120844.144
样本数:36338
下载大小:5089499872
数据集总大小:6481120844.144
配置项:
- 配置名称:librivox_spanish
数据文件:
- 划分集:train
路径:librivox_spanish/train-*
为默认配置
任务类别:
- 自动语音识别
语言:
- es(西班牙语)
标签:
- librivox spanish
- ciempiess-unam project
- ciempiess-unam
- 朗读语音
- 西班牙语语音
展示名称:Librivox西班牙语语料库(Librivox Spanish Corpus)
样本规模区间:10K<n<100K
---
# Librivox西班牙语语料库数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [构建动因](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集策展人](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **官方主页:** [CIEMPIESS-UNAM项目](https://ciempiess.org/)
- **代码仓库:** [LDC上架的Librivox西班牙语语料库](https://catalog.ldc.upenn.edu/LDC2020S01)
- **联络人:** [Carlos Mena](mailto:carlos.mena@ciempiess.org)
### 数据集摘要
Librivox是一个非商业、非营利且无广告的项目,致力于将所有公有领域书籍以音频形式免费在互联网上发布。基于此,我们下载了300部西班牙语有声书以构建Librivox西班牙语语料库(Librivox Spanish Corpus)。
该语料库总时长为73小时,由时长在3至10秒之间的音频文件组成,均经过人工分段。转录文本亦由西班牙语母语者手动完成。录制内容按性别(男性/女性)以及母语者/非母语者进行划分。
### 示例用法
Librivox西班牙语语料库仅包含训练划分集:
python
from datasets import load_dataset
librivox_spanish = load_dataset("ciempiess/librivox_spanish")
亦可通过如下方式加载:
python
from datasets import load_dataset
librivox_spanish = load_dataset("ciempiess/librivox_spanish",split="train")
### 支持任务
自动语音识别(Automatic Speech Recognition, ASR):该数据集可用于测试自动语音识别模型。模型将接收音频文件,并需将其转录为书面文本。最常用的评估指标为词错误率(Word Error Rate, WER)。
### 语言
该语料库的语言为西班牙语。
## 数据集结构
### 数据实例
python
{
'audio_id': 'LBVX_F_69_NNT_0035',
'audio': {
'path': '/home/carlos/.cache/HuggingFace/datasets/downloads/extracted/a506b24788064c4a89c858f829b408b0d2445c9cc30e52087e38ceee60fa03d7/non_native/female/F_69/LBVX_F_69_NNT_0035.flac',
'array': array([ 2.4414062e-04, -6.1035156e-05, -2.1362305e-04, ...,
-6.1035156e-04, -4.8828125e-04, -7.6293945e-04], dtype=float32), 'sampling_rate': 16000
},
'speaker_id': 'F_69',
'speaker_group': 'non_native',
'gender': 'female',
'duration': 9.975000381469727,
'normalized_text': 'del pequeño dormido en la mejilla que con timido afán su madre besa y se refleja alegre en la fajilla'
}
### 数据字段
* `audio_id`(字符串):音频片段的唯一标识符
* `audio`(datasets.Audio类型):包含音频文件路径、解码后的音频数组以及采样率的字典。在非流式模式(默认模式)下,路径指向本地已提取的音频文件;在流式模式下,路径为音频在归档文件内的相对路径(因文件未在本地下载与提取)。
* `speaker_id`(字符串):说话者的唯一标识符
* `speaker_group`(字符串):标记说话者为母语者或非母语者
* `gender`(字符串):说话者的性别(男性或女性)
* `duration`(float32):音频文件的时长,单位为秒
* `normalized_text`(字符串):音频片段的归一化转录文本
### 数据划分
该语料库仅包含训练划分集,共计36338条语音样本,涉及77名女性说话者与77名男性说话者,总时长为73小时1分钟。
## 数据集构建
### 构建动因
Librivox西班牙语语料库(LSC)具备如下特征:
* 该语料库总时长恰好为73小时1分钟,共包含36338条音频文件。
* 该语料库涵盖154名不同的说话者:77名男性与77名女性。
* 语料库内的每条音频文件时长约为3至10秒。
* 语料库按说话者进行分类,即同一说话者的所有录制内容存储于同一目录下。
* 数据同时按说话者性别(男性/女性)以及说话方式(母语者/非母语者)进行分类。
* 语料库内的音频与转录文本均由西班牙语母语者完成分段与转录。
* 语料库内的音频文件采用16kHz@16bit单声道格式存储。
* 每条音频文件的标识符兼容Kaldi、CMU-Sphinx等自动语音识别引擎。
### 源数据
#### 初始数据收集与归一化
Librivox西班牙语语料库是为训练自动语音识别声学模型而设计的语音语料库,其数据来源于[Librivox平台](https://librivox.org/)上的300部有声书。
### 标注
#### 标注流程
标注流程如下:
* 1. 对完整的有声书进行人工分段,仅保留语音质量良好的片段。
* 2. 执行第二轮分段操作,以分离不同说话者的语音并将其存入不同文件夹。
* 3. 由来自计算机科学、工程学、语言学等不同院系的学生对时长在5至10秒之间的语音文件进行转录。其中大部分标注者为西班牙语母语者,但未接受过专业转录培训。
#### 标注者信息
Librivox西班牙语语料库由墨西哥国立自治大学(Universidad Nacional Autónoma de México, UNAM)工程学院(Facultad de Ingeniería, FI)的“语音技术开发”社会服务项目于2016至2019年间创建,该项目负责人为Carlos Daniel Hernández Mena。
### 个人与敏感信息
该数据集可能包含可识别部分说话者身份的姓名;此外,录制内容来源于公开可获取的有声书,因此参与者并未刻意进行匿名处理。无论如何,您同意不会尝试识别本数据集中的说话者身份。
## 数据集使用注意事项
### 数据集的社会影响
该数据集的价值在于其包含发音清晰、噪声较低的语音样本。
### 偏差讨论
该语料库在性别上保持均衡,涵盖77名女性说话者与77名男性说话者。
### 其他已知局限性
Librivox西班牙语语料库由Carlos Daniel Hernández Mena创作,采用知识共享署名-相同方式共享4.0国际许可协议(CC-BY-SA-4.0)进行授权,且使用了来自[Librivox平台](https://librivox.org/)的内容。本作品的发布仅出于实用目的,不提供任何形式的担保,包括但不限于适销性或特定用途适用性的默示担保。
### 数据集策展人
该数据集由“语音技术开发”社会服务项目的学生收集,并由[Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena)于2019年完成策展。
### 许可信息
[CC-BY-SA-4.0](http://creativecommons.org/licenses/by-sa/4.0/)
### 引用信息
@misc{carlosmena2020librivoxspanish,
title={LIBRIVOX SPANISH CORPUS: Audio and Transcriptions taken from Librivox.org},
ldc_catalog_no={LDC2020S01},
DOI={https://doi.org/10.35111/a44z-6x49},
author={Hernandez Mena, Carlos Daniel},
journal={Linguistic Data Consortium, Philadelphia},
year={2020},
url={https://catalog.ldc.upenn.edu/LDC2020S01},
}
### 贡献致谢
作者谨向Alejandro V. Mena、Elena Vera与Angélica Gutiérrez致谢,感谢他们对“语音技术开发”社会服务项目的支持。同时感谢所有参与社会服务的学生付出的辛勤劳动。
特别感谢Librivox团队发布了构成该语料库的所有录制内容。
本数据集卡片的制作属于Severo Ochoa流动计划第16届(PN039300 - Severo Ochoa 2021 - E&T)的目标之一。
提供机构:
ciempiess
原始信息汇总
数据集许可证信息
- 许可证类型: CC-BY-SA-4.0



