five

humyn-labs/LATAM-High-Fidelity-ASR

收藏
Hugging Face2026-03-13 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/humyn-labs/LATAM-High-Fidelity-ASR
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: language dtype: string - name: file_name dtype: string - name: audio dtype: audio: sampling_rate: 16000 - name: transcript_json dtype: string - name: type dtype: string splits: - name: train num_bytes: 1393845879 num_examples: 28 download_size: 1011490115 dataset_size: 1393845879 configs: - config_name: default data_files: - split: train path: data/train-* language: - es - pt tags: - conversational_speech - multi-speaker - ASR - LATAM-languages size_categories: - n<1K license: cc-by-4.0 task_categories: - automatic-speech-recognition --- ## Dataset Overview This dataset contains high-quality conversational audio samples curated for **Automatic Speech Recognition** tasks in Spanish variants and Portugese. The dataset includes: * Paired **audio + transcripts** * Natural, non-scripted conversational speech * Single Speaker & Dual-speaker interactions ### Audio Specifications * **Sampling Rate:** 16 kHz – 24 kHz * **Bit Depth:** 16-bit * **Audio Type:** Non-scripted conversational speech --- ## Supported Languages | Language | | ------------------------ | | Spanish- Peru | | Spanish- Venezuela | | Spanish- Argentina | | Portugese (Brazil) | --- ## Speaker Representation * Natural, spontaneous dialogue * Balanced gender representation --- # Dataset Creation Methodology ## Data Collection Speech data was collected from native speakers across diverse regions: * **Spanish – Peru**: Urban and semi-urban communities with regional dialect coverage. * **Spanish – Venezuela**: Metro and non-metro regions reflecting standard and colloquial usage. * **Spanish – Argentina**: Cross-regional accent variation, including voseo and phonetic nuances. * **Portuguese – Brazil**: Cross-regional accents with a balance of formal and informal speech. This ensured: * Accent diversity * Natural conversational flow * Real-world dialogue patterns --- ## Recording Setup * Non-scripted, dual-speaker conversations * Duration: **10–30 minutes per recording** * Topics include: * Business * Finance * Politics * Everyday life discussions * Social topics --- ## Transcription Process * Manual transcription by native speakers * Reviewed for linguistic accuracy * Preserves: * Conversational fillers * Natural pauses --- # Dataset Intended Purpose ## Intended Uses This dataset is designed for: * Training and fine-tuning **Automatic Speech Recognition** models * Conversational ASR benchmarking * Speaker turn detection and interruption modeling * Informal speech modeling * Conversational AI research * Academic and open-source research --- ## Out-of-Scope Uses This dataset is **not intended for**: * Safety-critical or real-time production systems without additional validation * Commercial deployment without proper attribution and compliance with **CC BY 4.0** * Medical, clinical, legal, or diagnostic applications --- # License This dataset is released under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license. --- # 📬 Contact For dataset-related queries, please contact: **[[support@humynlabs.ai](mailto:support@humynlabs.ai)]**

数据集信息: 特征字段: - 名称:language,数据类型:字符串 - 名称:file_name,数据类型:字符串 - 名称:audio,数据类型为音频格式:采样率16000 - 名称:transcript_json,数据类型:字符串 - 名称:type,数据类型:字符串 数据集划分: - 划分名称:train(训练集),字节数:1393845879,样本数量:28 下载大小:1011490115 数据集总大小:1393845879 配置项: - 配置名称:default(默认配置),数据文件: - 划分:train(训练集),路径:data/train-* 支持语言: - es(西班牙语) - pt(葡萄牙语) 标签: - 对话语音(conversational_speech) - 多说话人(multi-speaker) - 自动语音识别(ASR, Automatic Speech Recognition) - 拉美语言(LATAM-languages) 样本规模类别:n<1K(样本量小于1000) 许可证:cc-by-4.0(知识共享署名4.0) 任务类别:自动语音识别(automatic-speech-recognition) --- ## 数据集概览 本数据集包含为西班牙语变体与葡萄牙语环境下**自动语音识别(Automatic Speech Recognition)**任务精心整理的高质量对话音频样本。 数据集包含以下内容: * 配对的**音频与转录文本** * 自然无脚本的对话语音 * 单说话人与双说话人交互场景 --- ## 音频规格 * **采样率:16 kHz – 24 kHz** * **位深度:16位** * **音频类型:非脚本化对话语音** --- ## 支持语言 | 语言名称 | | ------------------------ | | 西班牙语(秘鲁) | | 西班牙语(委内瑞拉) | | 西班牙语(阿根廷) | | 葡萄牙语(巴西) | --- ## 说话人表征 * 自然自发的对话内容 * 性别分布均衡 --- ## 数据集构建方法 ### 数据采集 语音数据采集自不同地区的母语使用者: * **西班牙语(秘鲁)**:覆盖城乡社区,涵盖区域方言变体。 * **西班牙语(委内瑞拉)**:覆盖大都会与非大都会区域,兼顾标准语与口语表达。 * **西班牙语(阿根廷)**:包含跨区域口音差异,涵盖voseo与语音细节特征。 * **葡萄牙语(巴西)**:覆盖跨区域口音,兼顾正式与非正式口语表达。 此举旨在确保: * 口音多样性 * 自然的对话流畅性 * 真实的对话场景模式 --- ## 录制设置 * 无脚本双说话人对话 * 单段录制时长:**10–30分钟** * 对话主题涵盖: * 商务 * 金融 * 政治 * 日常生活讨论 * 社会议题 --- ## 转录流程 * 由母语使用者进行人工转录 * 经过语言准确性校验 * 保留以下内容: * 对话填充词 * 自然停顿 --- ## 数据集预期用途 ### 允许使用场景 本数据集旨在用于: * 训练与微调**自动语音识别(Automatic Speech Recognition)**模型 * 对话式自动语音识别基准测试 * 说话人轮次检测与打断建模 * 非正式语音建模 * 对话式AI(Conversational AI)研究 * 学术与开源研究 --- ### 禁用使用场景 本数据集**不适合用于**: * 未经过额外验证的安全关键型或实时生产系统 * 未遵守**CC BY 4.0(知识共享署名4.0国际通用协议)**进行署名与合规的商业部署 * 医疗、临床、法律或诊断类应用 --- ## 许可证 本数据集采用**知识共享署名4.0国际通用(CC BY 4.0)**许可证发布。 --- ## 📬 联系方式 如有数据集相关疑问,请联系: **[[support@humynlabs.ai](mailto:support@humynlabs.ai)]**
提供机构:
humyn-labs
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作