humyn-labs/LATAM-High-Fidelity-ASR
收藏Hugging Face2026-03-13 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/humyn-labs/LATAM-High-Fidelity-ASR
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: language
dtype: string
- name: file_name
dtype: string
- name: audio
dtype:
audio:
sampling_rate: 16000
- name: transcript_json
dtype: string
- name: type
dtype: string
splits:
- name: train
num_bytes: 1393845879
num_examples: 28
download_size: 1011490115
dataset_size: 1393845879
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
language:
- es
- pt
tags:
- conversational_speech
- multi-speaker
- ASR
- LATAM-languages
size_categories:
- n<1K
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
---
## Dataset Overview
This dataset contains high-quality conversational audio samples curated for **Automatic Speech Recognition** tasks in Spanish variants and Portugese.
The dataset includes:
* Paired **audio + transcripts**
* Natural, non-scripted conversational speech
* Single Speaker & Dual-speaker interactions
### Audio Specifications
* **Sampling Rate:** 16 kHz – 24 kHz
* **Bit Depth:** 16-bit
* **Audio Type:** Non-scripted conversational speech
---
## Supported Languages
| Language |
| ------------------------ |
| Spanish- Peru |
| Spanish- Venezuela |
| Spanish- Argentina |
| Portugese (Brazil) |
---
## Speaker Representation
* Natural, spontaneous dialogue
* Balanced gender representation
---
# Dataset Creation Methodology
## Data Collection
Speech data was collected from native speakers across diverse regions:
* **Spanish – Peru**: Urban and semi-urban communities with regional dialect coverage.
* **Spanish – Venezuela**: Metro and non-metro regions reflecting standard and colloquial usage.
* **Spanish – Argentina**: Cross-regional accent variation, including voseo and phonetic nuances.
* **Portuguese – Brazil**: Cross-regional accents with a balance of formal and informal speech.
This ensured:
* Accent diversity
* Natural conversational flow
* Real-world dialogue patterns
---
## Recording Setup
* Non-scripted, dual-speaker conversations
* Duration: **10–30 minutes per recording**
* Topics include:
* Business
* Finance
* Politics
* Everyday life discussions
* Social topics
---
## Transcription Process
* Manual transcription by native speakers
* Reviewed for linguistic accuracy
* Preserves:
* Conversational fillers
* Natural pauses
---
# Dataset Intended Purpose
## Intended Uses
This dataset is designed for:
* Training and fine-tuning **Automatic Speech Recognition** models
* Conversational ASR benchmarking
* Speaker turn detection and interruption modeling
* Informal speech modeling
* Conversational AI research
* Academic and open-source research
---
## Out-of-Scope Uses
This dataset is **not intended for**:
* Safety-critical or real-time production systems without additional validation
* Commercial deployment without proper attribution and compliance with **CC BY 4.0**
* Medical, clinical, legal, or diagnostic applications
---
# License
This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license.
---
# 📬 Contact
For dataset-related queries, please contact:
**[[support@humynlabs.ai](mailto:support@humynlabs.ai)]**
数据集信息:
特征字段:
- 名称:language,数据类型:字符串
- 名称:file_name,数据类型:字符串
- 名称:audio,数据类型为音频格式:采样率16000
- 名称:transcript_json,数据类型:字符串
- 名称:type,数据类型:字符串
数据集划分:
- 划分名称:train(训练集),字节数:1393845879,样本数量:28
下载大小:1011490115
数据集总大小:1393845879
配置项:
- 配置名称:default(默认配置),数据文件:
- 划分:train(训练集),路径:data/train-*
支持语言:
- es(西班牙语)
- pt(葡萄牙语)
标签:
- 对话语音(conversational_speech)
- 多说话人(multi-speaker)
- 自动语音识别(ASR, Automatic Speech Recognition)
- 拉美语言(LATAM-languages)
样本规模类别:n<1K(样本量小于1000)
许可证:cc-by-4.0(知识共享署名4.0)
任务类别:自动语音识别(automatic-speech-recognition)
---
## 数据集概览
本数据集包含为西班牙语变体与葡萄牙语环境下**自动语音识别(Automatic Speech Recognition)**任务精心整理的高质量对话音频样本。
数据集包含以下内容:
* 配对的**音频与转录文本**
* 自然无脚本的对话语音
* 单说话人与双说话人交互场景
---
## 音频规格
* **采样率:16 kHz – 24 kHz**
* **位深度:16位**
* **音频类型:非脚本化对话语音**
---
## 支持语言
| 语言名称 |
| ------------------------ |
| 西班牙语(秘鲁) |
| 西班牙语(委内瑞拉) |
| 西班牙语(阿根廷) |
| 葡萄牙语(巴西) |
---
## 说话人表征
* 自然自发的对话内容
* 性别分布均衡
---
## 数据集构建方法
### 数据采集
语音数据采集自不同地区的母语使用者:
* **西班牙语(秘鲁)**:覆盖城乡社区,涵盖区域方言变体。
* **西班牙语(委内瑞拉)**:覆盖大都会与非大都会区域,兼顾标准语与口语表达。
* **西班牙语(阿根廷)**:包含跨区域口音差异,涵盖voseo与语音细节特征。
* **葡萄牙语(巴西)**:覆盖跨区域口音,兼顾正式与非正式口语表达。
此举旨在确保:
* 口音多样性
* 自然的对话流畅性
* 真实的对话场景模式
---
## 录制设置
* 无脚本双说话人对话
* 单段录制时长:**10–30分钟**
* 对话主题涵盖:
* 商务
* 金融
* 政治
* 日常生活讨论
* 社会议题
---
## 转录流程
* 由母语使用者进行人工转录
* 经过语言准确性校验
* 保留以下内容:
* 对话填充词
* 自然停顿
---
## 数据集预期用途
### 允许使用场景
本数据集旨在用于:
* 训练与微调**自动语音识别(Automatic Speech Recognition)**模型
* 对话式自动语音识别基准测试
* 说话人轮次检测与打断建模
* 非正式语音建模
* 对话式AI(Conversational AI)研究
* 学术与开源研究
---
### 禁用使用场景
本数据集**不适合用于**:
* 未经过额外验证的安全关键型或实时生产系统
* 未遵守**CC BY 4.0(知识共享署名4.0国际通用协议)**进行署名与合规的商业部署
* 医疗、临床、法律或诊断类应用
---
## 许可证
本数据集采用**知识共享署名4.0国际通用(CC BY 4.0)**许可证发布。
---
## 📬 联系方式
如有数据集相关疑问,请联系:
**[[support@humynlabs.ai](mailto:support@humynlabs.ai)]**
提供机构:
humyn-labs



