ciempiess/ciempiess_fem

Name: ciempiess/ciempiess_fem
Creator: ciempiess
Published: 2024-08-03 22:37:59
License: 暂无描述

Hugging Face2024-08-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ciempiess/ciempiess_fem

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - other language: - es license: - cc-by-sa-4.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - automatic-speech-recognition task_ids: [] pretty_name: 'CIEMPIESS FEM CORPUS: Audio and Transcripts of Female Speakers in Spanish.' tags: - ciempiess - spanish - mexican spanish - ciempiess project - ciempiess-unam project dataset_info: config_name: ciempiess_fem features: - name: audio_id dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: speaker_id dtype: string - name: gender dtype: string - name: duration dtype: float32 - name: country dtype: string - name: normalized_text dtype: string splits: - name: train num_bytes: 781435185.285 num_examples: 6505 download_size: 958145084 dataset_size: 781435185.285 configs: - config_name: ciempiess_fem data_files: - split: train path: ciempiess_fem/train-* default: true --- # Dataset Card for ciempiess_fem ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [CIEMPIESS-UNAM Project](https://ciempiess.org/) - **Repository:** [CIEMPIESS FEM at LDC](https://catalog.ldc.upenn.edu/LDC2019S07) - **Point of Contact:** [Carlos Mena](mailto:carlos.mena@ciempiess.org) ### Dataset Summary Since the publication of the [CIEMPIESS Corpus (LDC2015S07)](https://catalog.ldc.upenn.edu/LDC2015S07) in 2015 we have noticed that there is a lack of female speakers in the sources where we traditionally take audio to create new CIEMPIESS datasets. That is why we decided to create a corpus that helps to balance future gender unbalanced datasets. The CIEMPIESS FEM Corpus was created by recordings and human transcripts of 21 different women. 16 of these women are mexican. The other ones come from Latin American countries. The CIEMPIESS FEM Corpus is considered a CIEMPIESS dataset because it only contains audio from the same source of the first CIEMPIESS Corpus. It is "FEM" because it only contains recordings of female speakers. This corpus is part of the [CIEMPIESS Experimentation](https://catalog.ldc.upenn.edu/LDC2019S07), which is a set of three different datasets, specifically [CIEMPIESS COMPLEMENTARY](https://huggingface.co/datasets/ciempiess/ciempiess_complementary), [CIEMPIESS FEM](https://huggingface.co/datasets/ciempiess/ciempiess_fem) and [CIEMPIESS TEST](https://huggingface.co/datasets/ciempiess/ciempiess_test). CIEMPIESS is the acronym for: "Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social". ### Example Usage The CIEMPIESS FEM contains only the train split: ```python from datasets import load_dataset ciempiess_fem = load_dataset("ciempiess/ciempiess_fem") ``` It is also valid to do: ```python from datasets import load_dataset ciempiess_fem = load_dataset("ciempiess/ciempiess_fem",split="train") ``` ### Supported Tasks automatic-speech-recognition: The dataset can be used to test a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). ### Languages The language of the corpus is Spanish. ## Dataset Structure ### Data Instances ```python { 'audio_id': 'CMPF_F_05_MEX_0387', 'audio': { 'path': '/home/carlos/.cache/HuggingFace/datasets/downloads/extracted/8a3e27631315b39636ac51affc04585335f9699f9635269c49f7938936aa60b8/train/mexican/F_05/CMPF_F_05_MEX_0387.flac', 'array': array([0.0090332 , 0.0151062 , 0.01257324, ..., 0.01861572, 0.01797485, 0.02017212], dtype=float32), 'sampling_rate': 16000 }, 'speaker_id': 'F_05', 'gender': 'female', 'duration': 4.979000091552734, 'country': 'Mexico', 'normalized_text': 'entre dos o más personas pero eh tienen que darse de manera' } ``` ### Data Fields * `audio_id` (string) - id of audio segment * `audio` (datasets.Audio) - a dictionary containing the path to the audio, the decoded audio array, and the sampling rate. In non-streaming mode (default), the path points to the locally extracted audio. In streaming mode, the path is the relative path of an audio inside its archive (as files are not downloaded and extracted locally). * `speaker_id` (string) - id of speaker * `gender` (string) - gender of speaker (male or female) * `duration` (float32) - duration of the audio file in seconds. * `country` (string) - country of the speaker. * `normalized_text` (string) - normalized audio segment transcription. ### Data Splits The corpus counts just with the train split which has a total of 6505 speech files from 21 female speakers with a total duration of 13 hours and 54 minutes. ## Dataset Creation ### Curation Rationale The CIEMPIESS FEM (CF) Corpus has the following characteristics: * The CF has a total of 6505 audio files of 21 different women. It has a total duration of 13 hours and 54 minutes. * Every audio file in the CF has a duration between 5 and 10 seconds approximately. * Data in CF is classified by speaker and also by country, so one can easily select audios from a particular set of speakers to do experiments. * Audio files in the CF and the first CIEMPIESS are all of the same type. In both, speakers talk about legal and lawyer issues. They also talk about things related to the [UNAM University](https://www.unam.mx/) and the [Facultad de Derecho de la UNAM](https://www.derecho.unam.mx/). * As in the first CIEMPIESS Corpus, transcriptions in the CF were made by humans. * Speakers in the CF are not present in any other CIEMPIESS dataset. * Audio files in the CF are distributed in a 16khz@16bit mono format. ### Source Data #### Initial Data Collection and Normalization The CIEMPIESS FEM is a Radio Corpus designed to train acoustic models of automatic speech recognition and it is made out of recordings of spontaneous conversations in Spanish between a radio moderator and his guests. Most of the speech in these conversations has the accent of Central Mexico. All the recordings that constitute the CIEMPIESS FEM come from [RADIO-IUS](https://www.derecho.unam.mx/cultura-juridica/radio.php), a radio station belonging to [UNAM](https://www.unam.mx/). Recordings were donated by Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo from the [Facultad de Derecho de la UNAM](https://www.derecho.unam.mx/) with the condition that they have to be used for academic and research purposes only. ### Annotations #### Annotation process The annotation process is at follows: * 1. A whole podcast is manually segmented keeping just the portions containing good quality speech. * 2. A second pass os segmentation is performed; this time to separate speakers and put them in different folders. * 3. The resulting speech files between 5 and 10 seconds are transcribed by students from different departments (computing, engineering, linguistics). Most of them are native speakers but not with a particular training as transcribers. #### Who are the annotators? The CIEMPIESS FEM Corpus was created by the social service program ["Desarrollo de Tecnologías del Habla"](http://profesores.fi-b.unam.mx/carlos_mena/servicio.html) of the ["Facultad de Ingeniería"](https://www.ingenieria.unam.mx/) (FI) in the ["Universidad Nacional Autónoma de México"](https://www.unam.mx/) (UNAM) between 2016 and 2018 by Carlos Daniel Hernández Mena, head of the program. ### Personal and Sensitive Information The dataset could contain names revealing the identity of some speakers; on the other side, the recordings come from publicly available podcasts, so, there is not a real intent of the participants to be anonymized. Anyway, you agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset This dataset is valuable because it contains spontaneous speech. ### Discussion of Biases The dataset is not gender balanced. It is comprised of 6505 audio files of 21 different women and the vocabulary is limited to legal issues. ### Other Known Limitations "CIEMPIESS FEM CORPUS" by Carlos Daniel Hernández Mena is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/) License with the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ### Dataset Curators The dataset was collected by students belonging to the social service program ["Desarrollo de Tecnologías del Habla"](http://profesores.fi-b.unam.mx/carlos_mena/servicio.html). It was curated by [Carlos Daniel Hernández Mena](https://huggingface.co/carlosdanielhernandezmena) in 2018. ### Licensing Information [CC-BY-SA-4.0](https://creativecommons.org/licenses/by-sa/4.0/) ### Citation Information ``` @misc{carlosmenaciempiessfem2019, title={CIEMPIESS FEM CORPUS: Audio and Transcripts of Female Speakers in Spanish.}, ldc_catalog_no={LDC2019S07}, DOI={https://doi.org/10.35111/xdx5-n815}, author={Hernandez Mena, Carlos Daniel}, journal={Linguistic Data Consortium, Philadelphia}, year={2019}, url={https://catalog.ldc.upenn.edu/LDC2019S07}, } ``` ### Contributions The authors want to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their support to the social service program: "Desarrollo de Tecnologías del Habla." We also thank to the social service students for all the hard work.

提供机构：

ciempiess

原始信息汇总

数据集卡片 for ciempiess_fem

数据集描述

数据集摘要

CIEMPIESS FEM 语料库是通过录制和人工转录21名不同女性的音频创建的。其中16名女性来自墨西哥，其他来自拉丁美洲国家。该语料库旨在平衡未来性别不平衡的数据集。

支持的任务

自动语音识别（ASR）：该数据集可用于测试自动语音识别模型的性能，模型接收音频文件并将其转录为书面文本。主要的评估指标是词错误率（WER）。

语言

该语料库的语言为西班牙语。

数据集结构

数据实例

python { audio_id: CMPF_F_05_MEX_0387, audio: { path: /home/carlos/.cache/HuggingFace/datasets/downloads/extracted/8a3e27631315b39636ac51affc04585335f9699f9635269c49f7938936aa60b8/train/mexican/F_05/CMPF_F_05_MEX_0387.flac, array: array([0.0090332 , 0.0151062 , 0.01257324, ..., 0.01861572, 0.01797485, 0.02017212], dtype=float32), sampling_rate: 16000 }, speaker_id: F_05, gender: female, duration: 4.979000091552734, country: Mexico, normalized_text: entre dos o más personas pero eh tienen que darse de manera }

数据字段

audio_id (string) - 音频片段的ID
audio (datasets.Audio) - 包含音频路径、解码的音频数组和采样率的字典
speaker_id (string) - 说话者的ID
gender (string) - 说话者的性别（男性或女性）
duration (float32) - 音频文件的持续时间（秒）
country (string) - 说话者的国家
normalized_text (string) - 音频片段的标准化转录文本

数据分割

该语料库仅包含训练集，共有6505个来自21名女性说话者的语音文件，总时长为13小时54分钟。

数据集创建

策划理由

CIEMPIESS FEM（CF）语料库具有以下特点：

共有6505个音频文件，来自21名不同女性，总时长为13小时54分钟。
每个音频文件的持续时间大约在5到10秒之间。
数据按说话者和国家分类，便于选择特定说话者的音频进行实验。
音频文件类型与第一个CIEMPIESS语料库相同，说话者讨论法律和律师相关问题，以及与UNAM大学和UNAM法学院相关的内容。
转录由人工完成。
CF中的说话者未出现在其他CIEMPIESS数据集中。
音频文件以16kHz@16bit单声道格式分发。

源数据

CIEMPIESS FEM是一个广播语料库，旨在训练自动语音识别的声学模型，由西班牙语中广播主持人与嘉宾之间的即兴对话录音组成。大部分对话带有墨西哥中部口音。所有录音来自UNAM的RADIO-IUS广播电台，仅供学术和研究用途使用。

注释

注释过程如下：

手动分割整个播客，保留包含高质量语音的部分。
进行第二次分割，分离说话者并将其放入不同文件夹。
持续时间为5到10秒的语音文件由来自不同系（计算、工程、语言学）的学生转录，大部分是母语者，但未接受过专门的转录训练。

个人和敏感信息

数据集中可能包含揭示说话者身份的姓名，但由于录音来自公开播客，参与者并无匿名意图。使用者同意不尝试确定数据集中说话者的身份。

使用数据的注意事项

数据集的社会影响

该数据集因其包含即兴语音而具有价值。

偏见讨论

数据集性别不平衡，包含6505个来自21名不同女性的音频文件，词汇局限于法律问题。

其他已知限制

"CIEMPIESS FEM CORPUS" 由 Carlos Daniel Hernández Mena 根据 Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) 许可证授权，希望对使用者有所帮助，但没有任何保证，不包括任何明示或暗示的保证，如适销性和特定用途的适用性。

数据集策展人

数据集由社会服务项目“Desarrollo de Tecnologías del Habla”的学生收集，由 Carlos Daniel Hernández Mena 于2018年策展。

许可信息

CC-BY-SA-4.0

引用信息

@misc{carlosmenaciempiessfem2019, title={CIEMPIESS FEM CORPUS: Audio and Transcripts of Female Speakers in Spanish.}, ldc_catalog_no={LDC2019S07}, DOI={https://doi.org/10.35111/xdx5-n815}, author={Hernandez Mena, Carlos Daniel}, journal={Linguistic Data Consortium, Philadelphia}, year={2019}, url={https://catalog.ldc.upenn.edu/LDC2019S07}, }

贡献

作者感谢 Alejandro V. Mena, Elena Vera 和 Angélica Gutiérrez 对社会服务项目“Desarrollo de Tecnologías del Habla”的支持，以及社会服务学生的辛勤工作。

5,000+

优质数据集

54 个

任务类型

进入经典数据集