johnatanebonilla/ameresco-asr
收藏Hugging Face2024-01-03 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/johnatanebonilla/ameresco-asr
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: audio
dtype: audio
- name: id
dtype: int64
- name: time
dtype: string
- name: sentence
dtype: string
- name: orig_file_name
dtype: string
splits:
- name: train
num_bytes: 1440415495.732
num_examples: 19588
- name: validation
num_bytes: 169864351.65
num_examples: 2449
- name: test
num_bytes: 175082366.835
num_examples: 2449
download_size: 1551213354
dataset_size: 1785362214.2170002
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
task_categories:
- automatic-speech-recognition
- conversational
language:
- es
pretty_name: AMERESCO-ASR Subset
size_categories:
- 10K<n<100K
---
# Introduction
The "Ameresco-ASR" Subset is a specialized extract from the "Corpus Ameresco" (Albelda and Estellés, online), focusing on colloquial Spanish spoken in various cities across Spain and the Americas. This dataset has been specifically curated to facilitate the fine-tuning of Whisper, an automatic speech recognition system. To achieve this, audio and text segments ranging from 3 to 30 seconds have been automatically extracted from the Ameresco corpus, offering diverse samples of colloquial Spanish from different sociolects and regions. To ensure manageability and efficient processing, a maximum of 1024 tokens were used in the dataset, maintaining a balance between comprehensive coverage and computational efficiency.
# Content and Geographic Focus
The original Ameresco corpus was initiated as a collaborative project led by Antonio Briz, focusing on the study of colloquial Spanish in European and American geolects. The project included initiatives such as incorporating American usages and Americanisms into the Dictionary of Discourse Particles of Spanish (www.dpde.es), directed by Antonio Briz, Salvador Pons, and José Portolés, as well as the study of attenuation in various Spanish dialects (Projects Es. Var. Atenuación [IP. Marta Albelda], Es VaG. Atenuación [IP Marta Albelda, Maria Estellés]). The project Es.Por.Atenuación, led by Antonio Briz, was also part of this effort, which involved studying attenuation in Portuguese and comparing it with Spanish. Currently, the project is funded by the Esprint project of the Ministry of Science and Innovation (PID2020-114805GB-100, IP Marta Albelda, Maria Estellés).
The primary outcome of the Ameresco project is the compilation of the Ameresco corpus, which aims to gather samples of colloquial conversations from major cities in Spain, including Santiago de Chile, Tegucigalpa, Temuco, Tucumán, Barranquilla, Buenos Aires, Ciudad de México, Ciudad de Panamá, Iquique, La Habana, Las Palmas, Loja, Medellín, Monterrey, Querétaro, and Santa Cruz in the Americas. This corpus provides a rich resource for studying colloquial Spanish across different regions and sociolects.
# Transcription approach
See work documents on https://esvaratenuacion.es/protocolo-de-trabajo
# References
Albelda, M. y Estellés, M. (coords.): Corpus Ameresco, Universitat de València, ISSN: 2659-8337, www.corpusameresco.com
提供机构:
johnatanebonilla
原始信息汇总
数据集概述
数据特征
- 音频:音频数据
- ID:整数类型
- 时间:字符串类型
- 句子:字符串类型
- 原始文件名:字符串类型
数据分割
- 训练集:
- 字节数:1440415495.732
- 样本数:19588
- 验证集:
- 字节数:169864351.65
- 样本数:2449
- 测试集:
- 字节数:175082366.835
- 样本数:2449
数据大小
- 下载大小:1551213354
- 数据集大小:1785362214.2170002
配置
- 默认配置:
- 训练集路径:data/train-*
- 验证集路径:data/validation-*
- 测试集路径:data/test-*
任务类别
- 自动语音识别
- 对话
语言
- 西班牙语
数据集名称
- AMERESCO-ASR Subset
数据集大小类别
- 10K<n<100K



