humyn-labs/LATAM-High-Fidelity-ASR

Name: humyn-labs/LATAM-High-Fidelity-ASR
Creator: humyn-labs
Published: 2026-03-13 10:03:11
License: 暂无描述

Hugging Face2026-03-13 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/humyn-labs/LATAM-High-Fidelity-ASR

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: language dtype: string - name: file_name dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: transcript_json dtype: string - name: type dtype: string splits: - name: train num_bytes: 1393845879 num_examples: 28 download_size: 1011490115 dataset_size: 1393845879 configs: - config_name: default data_files: - split: train path: data/train-* language: - es - pt tags: - conversational_speech - multi-speaker - ASR - LATAM-languages size_categories: - n<1K license: cc-by-4.0 task_categories: - automatic-speech-recognition --- ## Dataset Overview This dataset contains high-quality conversational audio samples curated for **Automatic Speech Recognition** tasks in Spanish variants and Portugese. The dataset includes: * Paired **audio + transcripts** * Natural, non-scripted conversational speech * Single Speaker & Dual-speaker interactions ### Audio Specifications * **Sampling Rate:** 16 kHz – 24 kHz * **Bit Depth:** 16-bit * **Audio Type:** Non-scripted conversational speech --- ## Supported Languages | Language | | ------------------------ | | Spanish- Peru | | Spanish- Venezuela | | Spanish- Argentina | | Portugese (Brazil) | --- ## Speaker Representation * Natural, spontaneous dialogue * Balanced gender representation --- # Dataset Creation Methodology ## Data Collection Speech data was collected from native speakers across diverse regions: * **Spanish – Peru**: Urban and semi-urban communities with regional dialect coverage. * **Spanish – Venezuela**: Metro and non-metro regions reflecting standard and colloquial usage. * **Spanish – Argentina**: Cross-regional accent variation, including voseo and phonetic nuances. * **Portuguese – Brazil**: Cross-regional accents with a balance of formal and informal speech. This ensured: * Accent diversity * Natural conversational flow * Real-world dialogue patterns --- ## Recording Setup * Non-scripted, dual-speaker conversations * Duration: **10–30 minutes per recording** * Topics include: * Business * Finance * Politics * Everyday life discussions * Social topics --- ## Transcription Process * Manual transcription by native speakers * Reviewed for linguistic accuracy * Preserves: * Conversational fillers * Natural pauses --- # Dataset Intended Purpose ## Intended Uses This dataset is designed for: * Training and fine-tuning **Automatic Speech Recognition** models * Conversational ASR benchmarking * Speaker turn detection and interruption modeling * Informal speech modeling * Conversational AI research * Academic and open-source research --- ## Out-of-Scope Uses This dataset is **not intended for**: * Safety-critical or real-time production systems without additional validation * Commercial deployment without proper attribution and compliance with **CC BY 4.0** * Medical, clinical, legal, or diagnostic applications --- # License This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license. --- # 📬 Contact For dataset-related queries, please contact: **[[support@humynlabs.ai](mailto:support@humynlabs.ai)]**

数据集信息：特征字段： - 名称：language，数据类型：字符串 - 名称：file_name，数据类型：字符串 - 名称：audio，数据类型为音频格式：采样率16000 - 名称：transcript_json，数据类型：字符串 - 名称：type，数据类型：字符串数据集划分： - 划分名称：train（训练集），字节数：1393845879，样本数量：28 下载大小：1011490115 数据集总大小：1393845879 配置项： - 配置名称：default（默认配置），数据文件： - 划分：train（训练集），路径：data/train-* 支持语言： - es（西班牙语） - pt（葡萄牙语）标签： - 对话语音（conversational_speech） - 多说话人（multi-speaker） - 自动语音识别（ASR, Automatic Speech Recognition） - 拉美语言（LATAM-languages）样本规模类别：n<1K（样本量小于1000）许可证：cc-by-4.0（知识共享署名4.0）任务类别：自动语音识别（automatic-speech-recognition） --- ## 数据集概览本数据集包含为西班牙语变体与葡萄牙语环境下**自动语音识别（Automatic Speech Recognition）**任务精心整理的高质量对话音频样本。数据集包含以下内容： * 配对的**音频与转录文本** * 自然无脚本的对话语音 * 单说话人与双说话人交互场景 --- ## 音频规格 * **采样率：16 kHz – 24 kHz** * **位深度：16位** * **音频类型：非脚本化对话语音** --- ## 支持语言 | 语言名称 | | ------------------------ | | 西班牙语（秘鲁） | | 西班牙语（委内瑞拉） | | 西班牙语（阿根廷） | | 葡萄牙语（巴西） | --- ## 说话人表征 * 自然自发的对话内容 * 性别分布均衡 --- ## 数据集构建方法 ### 数据采集语音数据采集自不同地区的母语使用者： * **西班牙语（秘鲁）**：覆盖城乡社区，涵盖区域方言变体。 * **西班牙语（委内瑞拉）**：覆盖大都会与非大都会区域，兼顾标准语与口语表达。 * **西班牙语（阿根廷）**：包含跨区域口音差异，涵盖voseo与语音细节特征。 * **葡萄牙语（巴西）**：覆盖跨区域口音，兼顾正式与非正式口语表达。此举旨在确保： * 口音多样性 * 自然的对话流畅性 * 真实的对话场景模式 --- ## 录制设置 * 无脚本双说话人对话 * 单段录制时长：**10–30分钟** * 对话主题涵盖： * 商务 * 金融 * 政治 * 日常生活讨论 * 社会议题 --- ## 转录流程 * 由母语使用者进行人工转录 * 经过语言准确性校验 * 保留以下内容： * 对话填充词 * 自然停顿 --- ## 数据集预期用途 ### 允许使用场景本数据集旨在用于： * 训练与微调**自动语音识别（Automatic Speech Recognition）**模型 * 对话式自动语音识别基准测试 * 说话人轮次检测与打断建模 * 非正式语音建模 * 对话式AI（Conversational AI）研究 * 学术与开源研究 --- ### 禁用使用场景本数据集**不适合用于**： * 未经过额外验证的安全关键型或实时生产系统 * 未遵守**CC BY 4.0（知识共享署名4.0国际通用协议）**进行署名与合规的商业部署 * 医疗、临床、法律或诊断类应用 --- ## 许可证本数据集采用**知识共享署名4.0国际通用（CC BY 4.0）**许可证发布。 --- ## 📬 联系方式如有数据集相关疑问，请联系： **[[support@humynlabs.ai](mailto:support@humynlabs.ai)]**

提供机构：

humyn-labs

5,000+

优质数据集

54 个

任务类型

进入经典数据集