johnatanebonilla/ameresco-asr

Name: johnatanebonilla/ameresco-asr
Creator: johnatanebonilla
Published: 2024-01-03 12:24:24
License: 暂无描述

Hugging Face2024-01-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/johnatanebonilla/ameresco-asr

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: audio dtype: audio - name: id dtype: int64 - name: time dtype: string - name: sentence dtype: string - name: orig_file_name dtype: string splits: - name: train num_bytes: 1440415495.732 num_examples: 19588 - name: validation num_bytes: 169864351.65 num_examples: 2449 - name: test num_bytes: 175082366.835 num_examples: 2449 download_size: 1551213354 dataset_size: 1785362214.2170002 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* task_categories: - automatic-speech-recognition - conversational language: - es pretty_name: AMERESCO-ASR Subset size_categories: - 10K<n<100K --- # Introduction The "Ameresco-ASR" Subset is a specialized extract from the "Corpus Ameresco" (Albelda and Estellés, online), focusing on colloquial Spanish spoken in various cities across Spain and the Americas. This dataset has been specifically curated to facilitate the fine-tuning of Whisper, an automatic speech recognition system. To achieve this, audio and text segments ranging from 3 to 30 seconds have been automatically extracted from the Ameresco corpus, offering diverse samples of colloquial Spanish from different sociolects and regions. To ensure manageability and efficient processing, a maximum of 1024 tokens were used in the dataset, maintaining a balance between comprehensive coverage and computational efficiency. # Content and Geographic Focus The original Ameresco corpus was initiated as a collaborative project led by Antonio Briz, focusing on the study of colloquial Spanish in European and American geolects. The project included initiatives such as incorporating American usages and Americanisms into the Dictionary of Discourse Particles of Spanish (www.dpde.es), directed by Antonio Briz, Salvador Pons, and José Portolés, as well as the study of attenuation in various Spanish dialects (Projects Es. Var. Atenuación [IP. Marta Albelda], Es VaG. Atenuación [IP Marta Albelda, Maria Estellés]). The project Es.Por.Atenuación, led by Antonio Briz, was also part of this effort, which involved studying attenuation in Portuguese and comparing it with Spanish. Currently, the project is funded by the Esprint project of the Ministry of Science and Innovation (PID2020-114805GB-100, IP Marta Albelda, Maria Estellés). The primary outcome of the Ameresco project is the compilation of the Ameresco corpus, which aims to gather samples of colloquial conversations from major cities in Spain, including Santiago de Chile, Tegucigalpa, Temuco, Tucumán, Barranquilla, Buenos Aires, Ciudad de México, Ciudad de Panamá, Iquique, La Habana, Las Palmas, Loja, Medellín, Monterrey, Querétaro, and Santa Cruz in the Americas. This corpus provides a rich resource for studying colloquial Spanish across different regions and sociolects. # Transcription approach See work documents on https://esvaratenuacion.es/protocolo-de-trabajo # References Albelda, M. y Estellés, M. (coords.): Corpus Ameresco, Universitat de València, ISSN: 2659-8337, www.corpusameresco.com

提供机构：

johnatanebonilla

原始信息汇总

数据集概述

数据特征

音频：音频数据
ID：整数类型
时间：字符串类型
句子：字符串类型
原始文件名：字符串类型

数据分割

训练集：
- 字节数：1440415495.732
- 样本数：19588
验证集：
- 字节数：169864351.65
- 样本数：2449
测试集：
- 字节数：175082366.835
- 样本数：2449

数据大小

下载大小：1551213354
数据集大小：1785362214.2170002

配置

默认配置：
- 训练集路径：data/train-*
- 验证集路径：data/validation-*
- 测试集路径：data/test-*

任务类别

自动语音识别
对话

语言

西班牙语

数据集名称

AMERESCO-ASR Subset

数据集大小类别

10K<n<100K

5,000+

优质数据集

54 个

任务类型

进入经典数据集